This is an ever-evolving set of lecture notes for Introduction to Stochastic Processes (M362M). It should start with me explaining what stochastic processes are. Instead, here is a list of several questions you will be able to give answers to when you complete this course.
Question 1 In a simplistic model, the price of a share of a stock goes either up or down by \(\$1\) each day, with probability \(1/2\). You own a single share whose value today is \(\$100\), so that its tomorrow’s price will be \(\$101\) or \(\$99\) with probability \(1/2\), etc. Your strategy is to hold onto your share until one of the following two things happen: you go bankrupt (the stock price hits \(0\)), or you make a \(\$50\) dollar profit (the stock price hits \(\$150\).)
Question 2. A person carrying a certain disease infects a random number of people in a week, and then stops being infectious. Each of the infected people transmits the disease in the same way, etc. Suppose that the number of people each (infectious) individual infects is either \(0\), \(1\) or \(2\) or \(3\), each with probability \(1/4\) and that different infectious individuals may infect different number of people and behave independently of each other.
Question 3. In a game of tennis, Player \(1\) wins against Player \(2\) in each rally (the smallest chunk of the match that leads to point, i.e., to a score change from \(15-30\) to \(30-30\), for example) with probability \(p\). What is the probability that Player \(1\) wins
Question 4. A knight starts in the lower left corner of the chess board and starts moving “randomly”. That means that from any position, it chooses one of the possible (legal) moves and takes it, with all legal moves having the same probability. It keeps doing the same thing until it comes back to the square it started from.
Question 5. How does Google search work?
Learning basic R is an important part of this course, and the first order of business is to download and install an R distribution on your personal computer. We will be using RStudio as an IDE (integrated development environment). Like R itself, it is free and readily available for all major platforms. To download R to your computer, go to https://cran.rstudio.com and download the version of R for your operating system (Windows, Mac or Linux). If you are on a Mac, you want the “Latest release” which, at the time of writing, is 4.3.2, with code name “Eye Holes”. On Windows, follow the link “install R for the first time”. We are not going to do any cutting edge stuff in this class, so an older release should be fine, too, if you happen to have it already installed on your system. Once you download the installation file (.pkg on a Mac or .exe on Windows), run it and follow instructions. If you are running Linux, you don’t need me to tell you what to do. Once it is successfully installed, don’t run the installed app. We will use RStudio for that.
To install RStudio, go to https://posit.co/download/rstudio-desktop/, and follow the instructions under “2: Install RStudio”. After you download and install it, you are ready to run it. When it opens, you will see something like this
The part on the left is called the console and that is (one
of the places) where you enter commands. Before you do, it is important
to adjust a few settings. Open the options window by navigating to to
Tools->Global Options. In there, uncheck “Restore .RData into
workspace on startup” and set “Save workspace to .RData on exit” to
“Never”, as shown below:
This way, R will not pollute your environment with values you defined two weeks ago and completely forgot about. These settings are really an atavism and serve no purpose (for users like us) other than to introduce hard-to-track bugs.
There are many other settings you can play with in RStudio, but the two I mentioned above are the only ones that I really recommend setting as soon as you install it.
Finally, we need to install several R packages we will be using (mostly implicitly) during the class. First, run the following command in your console
install.packages( "tidyverse")
If R asks “Do you want to install from sources the packages which need compilation? (Yes/no/cancel)” answer no.
This will install a number of useful packages and should only take
about a minute or two. The next part is a bit longer, and can take up to
15 minutes if you have a slow computer/internet connection. You only
have to do it once, though. Skip both steps involving
tinytex below if you have LaTeX already installed on your
system1.
Start with
install.packages("tinytex")
followed by
tinytex::install_tinytex()
Note that if you go to the top right corner of each of the code blocks (gray boxes) containing instructions above, an icon will appear. If you click on it, it will copy the content of the box into your clipboard, and you can simply paste it into RStudio. You can do that with any code block in these notes.
Once R and RStudio are on your computer, it is time to get acquainted with the basics of R. This class is not about the finer points of R itself, and I will try to make your R experience as smooth as possible. After all, R is a tool that will help us explore and understand stochastic processes. Having said that, it is important to realize that R is a powerful programming language specifically created for statistical and probabilistic applications. Some knowledge of R is a valuable skill to have in today’s job market, and you should take this opportunity to learn it. The best way, of course, is by using it, but before you start, you need to know the very basics. Don’t worry, R is very user friendly and easy to get started in. In addition, it has been around for a long time (its predecessor S appeared in 1976) and is extremely well documented - google introduction to R or a similar phrase, and you will get lots of useful hits.
My plan is to give you a bare minimum in the next few paragraphs, and then to explain additional R concepts as we need them. This way, you will not be overwhelmed right from the start, and you will get a bit of a mathematical context as you learn more. Conversely, learning R commands will help with the math, too.
There at least three different ways of inputting commands into R - through console, scripts and R-notebooks.
The console, as I already mentioned, is a window in
RStudio where you can enter your R commands one by one. As a command is
entered (and enter pressed) R will run it and display the result below.
A typical console session looks like this
If you define a variable in a command, it will be available in all the
subsequent commands. This way of interacting with R is perfect for
quick-and-dirty computations and, what is somewhat euphemistically
called “prototyping”. In other words, this way you are using R as a
calculator. There is another reason why you might be using the console.
It is perfect for package installation and for help-related commands. If
you type
help('log'), the output will appear in the
Help pane on the right. You can also see all the available
variables in the Environment pane on the (top) right.
As your needs increase, you will need more complex (and longer) code
to meet them. This is where scripts come in. They are
text files (but have the extension .R) that hold R code.
Scripts can run as a whole, and be saved for later. To create a new
script, go to File->New File->R Script. That will split your
RStudio window in two:
The top part will become a script editor, and your console will shrink
to occupy the bottom part. You can write you code in there, edit and
update it, and then run the whole script by clicking on Source, or
pressing the associated shortcut key.
Inspired by Python Jupyter notebooks, R notebooks
are a creature somewhere between scripts and the console, but also have
some features of their own. An R notebook is nothing other than a
specially formatted text file which contains chunks of R code
mixed with regular text. You can think of these chunks as mini scripts.
What differentiates them from scripts is that chunks can be executed
(evaluated) and the output becomes a part of the notebook:
R notebooks are R’s implementation of literate programming. The
idea is that documentation should be written at the same time as the
program itself. As far as this course is concerned, R notebooks are just
the right medium for homework and exam submission. You can run code and
provide the interpretation of its output in a single document. See here for more
information.
Each chapter in these lecture notes is an R notebook!
The most important thing about learning R (and many
other things, for that matter) is knowing whom (and how) to ask for
help. Luckily, R is a well established language, and you can get a lot
of information by simply googling your problem. For example, if you
google logarithm in R the top hit (at the time of writing)
gives a nice overview and some examples.
Another way to get information about a command or a concept in R is
to use the command help. For example, if you input
help("log") or ?log in your console, the right
hand of your screen will display information on the function
log and some of its cousins. Almost every help entry has
examples at the bottom, and that is where I always go first.
Objects we will be manipulating in this class are almost exclusively vectors and matrices. The simplest vectors are those that have a single component, in other words, numbers. In R, you can assign a number to a variable using two different notations. Both
a <- 1
and
a = 1
will assign the value \(1\) to the
variable a. If you want to create a longer vector, you can
use the concatenation operator c as
follows:
x = c(1, 2, 3, 4)
Once you evaluate the above in your console, the value of
x is stored and you can access it by using the command
print
print(x)
## [1] 1 2 3 4
or simply evaluating x itself:
x
## [1] 1 2 3 4
Unlike all code blocks above them, the last two contain both input
and output. It is standard not to mark the output by any symbol (like
the usual >), and to mark the output by ##
which otherwise marks comments. This way, you can copy any code block
from these notes and paste it into the console (or your script) without
having to modify it in any way. Try it!
We built the vector x above by concatenating four
numbers (vectors of length 1). You can concatenate vectors of different
sizes, too:
a = c(1, 2, 3)
b = c(4, 5, 6)
(x = c(a, b, 7))
## [1] 1 2 3 4 5 6 7
You may be wondering why I put x = c(a,b,7) in
parentheses. Without them, x would still become
(1,2,3,4,5,6,7), but its value would not be printed out. A statement in
parentheses is not only evaluated, but its result is also printed out.
This way, (x = 2+3) is equivalent to x = 2+3
followed by x or print(x).
Vectors can contain things other than numbers. Strings, for example:
(x = c("Picard", "Data", "Geordi"))
## [1] "Picard" "Data" "Geordi"
If you need a vector consisting of consecutive numbers, use the colon
: notation:
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
For sequences of equally spaced numbers, use the command
seq (check its help for details)
seq(from = 5, to = 20, by = 3)
## [1] 5 8 11 14 17 20
An important feature or R is that many of its functions are vectorized. That means that if you give such a function a vector as an argument, the returned value will be a vector of results of that operation performed element by element. For example
x = c(10, 20, 30)
y = c(2, 4, 5)
x + y
## [1] 12 24 35
x * y
## [1] 20 80 150
x^2
## [1] 100 400 900
cos(x)
## [1] -0.8390715 0.4080821 0.1542514
The vectors do not need to be of the same size. R uses the recycling rule - it recycles the values of the shorter one, starting from the beginning, until its size matches the longer one:
x = c(10, 20, 30, 40, 50, 60)
y = c(1, 3)
x + y
## [1] 11 23 31 43 51 63
The case where the shorter vector is of length 1 is particularly useful:
x = c(10, 20, 30, 40)
x + 1
## [1] 11 21 31 41
x * (-2)
## [1] -20 -40 -60 -80
Extracting parts of the vector is accomplished by using the
indexing operator []. Here are some
examples (what do negative numbers do?)
x = c(10, 20, 30, 40, 50)
x[1]
## [1] 10
x[c(1, 2)]
## [1] 10 20
x[-1]
## [1] 20 30 40 50
x[-c(3, 4)]
## [1] 10 20 50
x[1:4]
## [1] 10 20 30 40
x[c(1, 1, 2, 2, 5, 4)]
## [1] 10 10 20 20 50 40
People familiar with Python should be aware of the following two differences: 1. indexing starts at 1 and not 0, and 2. negative indexing removes components; it does not start counting from the end!
It is important to note that the thing you put inside []
needs to be a vector itself. The above examples all dealt with numerical
indices, but you can use logical indices, too. A variable is said to be
logical or Boolean if it can take only
one of the two values TRUE or FALSE. A vector
whose components are all logical, are called, of course, logical
vectors. You can think of logical indexing as the operation where you go
through your original vector, and choose which components you want to
keep (TRUE) and which you want the throw away
(FALSE). For example
x = c(10, 20, 30, 40, 50)
y = c(TRUE, FALSE, FALSE, TRUE, TRUE)
x[y]
## [1] 10 40 50
This is especially useful when used together with the
comparison operators. The expressions like
x < y or x == y are operators2 in R, just like
x + y or x / y. The difference is that
< and == return logical values. For
example
1 == 2
## [1] FALSE
3 > 4
## [1] FALSE
3 >= 2
## [1] TRUE
These operators are vectorized, so you can do things like this
x = c(1, 2, 3, 4, 5)
y = c(1, 3, 3, 2, 5)
x == y
## [1] TRUE FALSE TRUE FALSE TRUE
or, using recycling,
x = c(1, 2, 3, 4, 5)
x > 3
## [1] FALSE FALSE FALSE TRUE TRUE
Let’s combine that with indexing. Suppose that we want to keep only
the values greater than 4 in the vector x. The vector
y = ( x > 4 ) is going to be of the same length as
x and contain logical values. When we index x
using it, only the values of x on positions where
x > 4 will survive, and these are exactly the values we
needed:
x = c(3, 2, 5, 3, 1, 5, 6, 4)
y = (x > 4)
x[y]
## [1] 5 5 6
or, simply,
x[x > 4]
## [1] 5 5 6
Indexing can be used to set the values of a vector just as easily
x = c(10, 20, 30, 40, 50)
x[2:4] = c(0, 1, 2)
x
## [1] 10 0 1 2 50
Recycling rules apply in the same way as above
x = c(10, 20, 30, 40, 50)
x[c(1, 2, 5)] = 7
x
## [1] 7 7 30 40 7
A matrix in R can be created using the command matrix.
The unusual part is that the input is a vector and R populates the
components of the matrix by filling it in column by column or row by
row. As always, an example will make this clear
x = c(1, 2, 3, 4, 5, 6)
(A = matrix(x, nrow = 2, ncol = 3, byrow = TRUE))
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
The first argument of the function matrix is the vector
which contains all the values. If you want a matrix with m rows and n
columns, this vector should be of size \(m
n\). The arguments ncol and nrow are
self-explanatory, and byrow is a logical argument which
signals whether to fill by columns or by rows. Here is what happens when
we set byrow = FALSE
x = c(1, 2, 3, 4, 5, 6)
(A = matrix(x, nrow = 2, ncol = 3, byrow = FALSE))
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
Accessing components of a matrix is as intuitive as it gets
(A = matrix(c(1, -1, 7, 2), nrow = 2, ncol = 2))
## [,1] [,2]
## [1,] 1 7
## [2,] -1 2
A[1, 2]
## [1] 7
Note that I did not use the argument byrow at all. In
such cases, R always uses the default value (documented in the
function’s help). For matrix the default value of
byrow is FALSE, i.e., it fills the matrix
column by column. This is not what we usually want because we tend to
think of matrices as composed of rows. Moral: do not forget
byrow = TRUE if that is what you, indeed, want.
Usual matrix operations can be performed in R in the obvious way
(A = matrix(c(1, -1, 7, 2), nrow = 2, ncol = 2))
## [,1] [,2]
## [1,] 1 7
## [2,] -1 2
(B = matrix(c(2, 2, -3, -4), nrow = 2, ncol = 2))
## [,1] [,2]
## [1,] 2 -3
## [2,] 2 -4
A + B
## [,1] [,2]
## [1,] 3 4
## [2,] 1 -2
You should be careful with matrix multiplication. The naive operator
* yields a matrix, but probably not the one you want (what
does * do?)
(A = matrix(c(1, 2, 0, 1), nrow = 2, ncol = 2))
## [,1] [,2]
## [1,] 1 0
## [2,] 2 1
(B = matrix(c(3, 5, 1, 0), nrow = 2, ncol = 2))
## [,1] [,2]
## [1,] 3 1
## [2,] 5 0
A * B
## [,1] [,2]
## [1,] 3 0
## [2,] 10 0
If you want the matrix product, you have to use %*%
A %*% B
## [,1] [,2]
## [1,] 3 1
## [2,] 11 2
The following syntax is used to define functions in R:
my_function = function(x, y, z) {
return(x + y + z)
}
The function my_function returns the sum of its
arguments. Having defined it, as above, we can use it like this
my_function(1, 3, 9)
## [1] 13
Neither the output nor the arguments of a function in R are
restricted to numbers. Our next example function, named
winners, takes two vectors as arguments and returns a
vector. Its components are those components of the first input vector
(x) that are larger than the corresponding components of
the second input vector (y)
winners = function(x, y) {
z = x > y
return(x[z])
}
winners(c(1, 4, 5, 6, 2), c(2, 3, 3, 9, 2))
## [1] 4 5
Note how we used several things we learned above in this function.
First, we defined the logical vector which indicates where
x is larger than y. Then, we used logical
indexing to return only certain components of x.
Our final element of R is its if-else statement. The
syntax of the if statement is
if (condition) {
statement
}
where condition is anything that has a logical value,
and statement is any R statement. First R evaluates
condition. If it is true, it runs statement.
If it is false, nothing happens. If you want something to happen if (and
only if) your condition is false, you need an if-else
statement:
if (condition) {
statement1
} else {
statement2
}
This way, statement1 is evaluated when
condition is true and statement1 when it is
false. Since conditions inside the if statement return
logical values, we can combine them using ands, ors or
nots. The R notation for these operations is &, | and !
respectively, and to remind you what they do, here is a simple table
| x | y | x & y (and) | x | y (or) | !x (not) |
|---|---|---|---|---|
| TRUE | TRUE | TRUE | TRUE | FALSE |
| TRUE | FALSE | FALSE | TRUE | FALSE |
| FALSE | TRUE | FALSE | TRUE | TRUE |
| FALSE | FALSE | FALSE | FALSE | TRUE |
Let’s put what we learned about functions and if-else statements
together to write a function distance_or_zero whose
arguments are coordinates x and y of a point
in the plane, and whose output is the distance from the point (x,y) to
the origin if this distance happens to be between 1 and 2, and and 0
otherwise. We will use similar functions later when we discuss Monte
Carlo methods:
distance_or_zero = function(x, y) {
distance = sqrt(x^2 + y^2)
if (distance <= 2 & distance >= 1) {
return(distance)
} else {
return(0)
}
}
distance_or_zero(1.2, 1.6)
## [1] 2
distance_or_zero(2, 3)
## [1] 0
Here are several simple problems. Their goal is to give you an idea of exactly how much R is required to get started in this course.
Compute the following (your answer should be a decimal number):
Note: some of the answers will look like this 3.14e+13.
If you do not know what that means, google scientific notation
or E notation.
Define two variables \(a\) and \(b\) with values \(3\) and \(4\) and “put” their product into a variable called \(c\). Output the value of \(c\).
Define two vectors \(x\) and \(y\) of length \(3\), such that the components of \(x\) are \(1,2,3\) and the components of \(y\) are \(8,9,0\). Ouput their (componentwise) sum.
Define a \(2\times 2\) matrix \(A=\begin{pmatrix} 1 & 2 \\ -1 & 3 \end{pmatrix}\).
Compute the matrix square \(A^2\).
Construct a vector \(x\) which contains all numbers from \(1\) to \(100\).
Construct a vector \(y\) which contains squares of all numbers between \(20\) and \(2000\).
Construct a vector \(z\) which contains only those components of \(y\) whose values are between \(400,000\) and \(500,000\).
Compute the average (arithmetic mean) of all the components of \(z\). There is an R function that does that for you - find it!
Write a function that takes a numerical argument \(x\) and returns \(5\) if \(x\geq 5\) and \(x\) itself otherwise.
Write a function that returns TRUE (a logical value)
if its argument is between \(2\) and
\(3\) and FALSE
otherwise.
(Extra credit) Write a function that takes two equal-sized vectors as arguments and returns the angle between them in degrees. For definiteness, the angle between two vectors is defined to be \(0\) when either one of them is \((0,0,\dots,0)\).
In the spirit of “learn by doing”, these lecture notes contain many “Problems”, both within the sections, and at the very end of each chapter. Those within sections come with solutions and usually introduce new concepts. They often feature a Comments section right after the solution subdivided into R and Math comments focusing on the computational or conceptual features, respectively. Note that you are not expected to be able to do the problems within sections before reading their solutions and comments, so don’t worry if you cannot. It is a good practice to try, though. Problems at the end, in the Additional Problems section are (initially) left unsolved. They do not require any new ideas and are there to help you practice the skills presented before.
… where we also review some probability along the way.
“Draw” 50 simulations from the geometric distribution with parameter \(p=0.4\).
rgeom(50, prob = 0.4)
## [1] 1 0 3 4 1 2 0 0 2 2 0 1 5 0 1 0 2 1 1 0 2 2 2 1 0 0 1 3 2 2 1 1 1 3 5 0 1 1
## [39] 0 0 0 1 2 0 1 1 1 0 1 0
Comments (R): R makes it very easy to simulate draws from a
large class of named distributions3, such as geometric,
binomial, uniform, normal, etc. For a list of all available
distributions, run help("distributions") Each available
distribution has an R name; the uniform is unif
the normal is norm and the binomial is binom,
etc. If you want to simulate \(n\)
draws (aka a sample of size \(n\)) from a distribution, you form a full
command by appending the letter r to its R name and use
\(n\) as an argument. That is how we
arrived to rgeom(50) in the solution above. The additional
arguments of the function rgeom have to do with the
parameters of that distribution. Which parameters go with which
distributions, and how to input them as arguments to rgeom
or rnorm is best looked up in R’s extensive documentation.
Try help("rnorm"), for example.
Comments (Math): You could spend your whole life trying to understand what it really means to “simulate” or “generate” a random number. The numbers you obtain from so-called random number generators (RNG) are never random. In fact, they are completely deterministically generated. Still, sequences of numbers obtained from (good) random number generators share so many properties with sequences of mythical truly random numbers, that we can use them as if they were truly random. For the purposes of this class, you can assume that the numbers R gives you as random are random enough. Random number generation is a fascinating topic at the intersection of number theory, probability, statistics, computer science and even philosophy, but we do not have the time to cover any of it in this class. If you want to read a story about a particularly bad random number generator, go here.
You might have encountered a geometric distribution before. A random variable with that distribution can take any positive integer value or \(0\), i.e., its support is \({\mathbb N}_0=\{0,1,2,3,\dots\}\). As you can see from the output above, the value \(0\) appears more often than the value \(3\), and the value \(23\) does not appear at all in this particular simulation run. The probability of seeing the value \(k\in \{0,1,2,3,\dots\}\) as a result of a single draw is given by \((1-p)^k p\), where \(p\) is called the parameter of the distribution.
That corresponds to the following interpretation of the geometric distribution: keep tossing a biased coin (with probability p of obtaining H) until you see the first H; the number Ts before that is that value your geometric random variable4 If we put these probabilities in a single table (and choose \(p=0.4\), for example) it is going to look like this:| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | … | |
|---|---|---|---|---|---|---|---|---|---|
| Prob. | 0.4 | 0.24 | 0.144 | 0.086 | 0.052 | 0.031 | 0.019 | 0.011 | … |
Of course, the possible values our random variable can take do not
stop at \(7\). In fact, there are
infinitely many possible values, but we do not have infinite space. Note
that even though the value \(23\) does
not appear in the output of the command rgeom above, it
probably would if we simulated many more than \(50\) values. Let’s try it with \(500\) draws - the table below counts how
many \(0s\), \(1s\), \(2s\), etc. we got:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| 208 | 132 | 62 | 43 | 23 | 16 | 8 | 3 | 2 | 1 | 2 |
Still no luck, but we do observe values above 5 more often. By trial and error, we arrive at about \(1,000,000\) as the required number of simulations:
| 0 | 1 | 2 | 3 | … | 23 | 24 | 25 | 26 |
|---|---|---|---|---|---|---|---|---|
| 400616 | 238946 | 144274 | 86489 | … | 3 | 3 | 3 | 3 |
Compute the probability that among \(1,000,000\) draws of a geometric random variable with parameter \(p=0.4\), we never see a number greater than \(22\).
First, we compute the probability that the value seen in a single draw does not exceed \(22\):
pgeom(22, prob = 0.4)
## [1] 0.9999921
Different draws are independent of each other, so we need to raise this to the power \(1,000,000\).
(pgeom(22, prob = 0.4))^(1000000)
## [1] 0.0003717335
Comments (R): The command we used here is pgeom
which is a cousin of rgeom. In general, R commands that
involve named probability distributions consist of two parts. The
prefix, i.e., the initial letter (p in this case) stands
for the operation you want to perform, and the rest is the R name of the
distribution. There are 4 prefixes, and the commands they produce
are
| Prefix | Description |
|---|---|
r |
Simulate random draws from the distribution. |
p |
Compute the cumulative probability distribution function (cdf) (NOT pdf) |
d |
Compute the probability density (pdf) or the probability mass function (pmf) |
q |
Compute the quantile function |
(see the Math section below for the reminder of what these things
are). In this problem, we are dealing with a geometric random variable
\(X\), which has a discrete
distribution with support \(0,1,2,3,\dots\). Therefore, the R name is
geom. We are interested in the probability \({\mathbb{P}}[ X\leq 22]\), which
corresponds to the cdf of \(X\) at
\(x=22\), so we use the the prefix
p. Finally, we used the named parameter p and
gave it the value p = 0.4, because the geometric
distribution has a single parameter \(p\).
This problem also gives us a chance to discuss precision. As you can see, the probability of a single draw not exceeding \(22\) is very close to \(1\). In fact, it is equal to it to 5 decimal places. By default, R displays 7 significant digits of a number. That is enough for most applications (and barely enough for this one), but sometimes we need more. For example, let’s try to compute the probability of seeing no T (tails) in 10 tosses of a biased coin, where the probability of H (heads) is 0.9.
1 - 0.1^10
## [1] 1
While very close to it, this probability is clearly not equal to
\(1\), as suggested by the output
above. The culprit is the default precision. We can increase the
precision (up to \(22\) digits) using
the options command
options(digits = 17)
1 - 0.1^10
## [1] 0.99999999989999999
Precision issues like this one should not appear in this course, but they will out there “in the wild”, so it might be a good idea to be aware of them.
Comments (Math): If you forgot all about pdfs, cdfs and such things here is a little reminder:
| cdf | \(F(x) = {\mathbb{P}}[X\leq x]\) |
| \(f(x)\) such that \({\mathbb{P}}[X \in [a,b]] = \int_a^b f(x) \, dx\) for all \(a<b\) | |
| pmf | \(p(x)\) such that \({\mathbb{P}}[X=a_n] = p(a_n)\) for some sequence \(a_n\) |
| qf | \(q(p)\) is a number such that \({\mathbb{P}}[ X \leq q(p)] = p\) |
Those random variables that admit a pdf are called continuous. The prime examples are the normal, or the exponential distribution. The ones where a pmf exists are called discrete. The sequence \(a_n\) covers all values that such a, discrete, random variable can take. Most often, \(a_n\) either covers the set of all natural numbers \(0,1,2,\dots\) or a finite subset such as \(1,2,3,4,5,6\).
Coming back to our original problem, we note that the probability we obtained is quite small. Since \(1/0.000372\) is about \(2690\), we would have to run about \(2690\) rounds of \(1,000,000\) simulations before the largest number falls below \(23\).
Compute the \(0.05\), \(0.1\), \(0.4\), \(0.6\) and \(0.95\) quantiles of the normal distribution with mean \(1\) and standard deviation \(2\).
qnorm(c(0.05, 0.1, 0.4, 0.6, 0.95), mean = 1, sd = 2)
## [1] -2.2897073 -1.5631031 0.4933058 1.5066942 4.2897073
Comments (R): The function we used is qnorm,
with the prefix q which computes the quantile function and
the R name norm because we are looking for the quantiles of
the normal distribution. The additional (named) parameters are where the
parameters of the distribution come in (the mean and the standard
variation) in this case. Note how we plugged in the entire vector
c(0.05, 0.1, 0.4, 0.6, 0.98) instead of a single value into
qnorm. You can do that because this function is
vectorized. That means that if you give it a vector as
an argument, it will “apply itself” to each component of the vector
separately, and return the vector of results. Many (but not all)
functions in R are vectorized5.
As a sanity check, let’s apply pnrom (which computes the
cdf of the normal) to these quantile values:
p = qnorm(c(0.05, 0.1, 0.4, 0.6, 0.95), mean = 1, sd = 2)
pnorm(p, mean = 1, sd = 2)
## [1] 0.05 0.10 0.40 0.60 0.95
As expected, we got the original values back - the normal quantile function and its cdf are inverses of each other.
Comments (Math): Computing the cdf of a standard normal is the same thing reading a normal table. Computing a quantile is the opposite; you go into the middle of the table and find your value, and then figure out which “Z” would give you that value.
Simulate \(60\) throws of a fair \(10\)-sided die.
sample(1:10, 60, replace = TRUE)
## [1] 2 8 9 8 4 7 7 7 2 3 3 10 6 1 9 7 4 7 6 2 2 3 10 1 9
## [26] 7 3 2 8 4 1 2 8 1 4 9 1 9 10 10 6 1 8 6 1 10 5 1 6 9
## [51] 8 3 8 9 4 6 1 6 7 8
Comments (Math): Let \(X\) denote the outcome of a single throw of a fair \(10\)-sided die. The distribution of \(X\) is discrete (it can only take the values \(1,2,\dots, 10\)) but it is not one of the more famous named distributions. I guess you could call it a discrete uniform on \({1,2,\dots, 10}\), but a better way to describe such distribution is by a distribution table, which is really just a list of possible values a random variable can take, together with their, respective, probabilities. In this case,
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|
| 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 |
Comments (R): The command used to draw a sample from a
(finite) collection is, of, course sample. The first
argument is a vector, and it plays the role of the “bag” from which you
are drawing. If we are interested in repeated, random samples, we also
need to specify replace = FALSE otherwise, you could draw
any single number at most once:
sample(1:10, 8, replace = FALSE)
## [1] 1 5 6 7 8 10 3 4
With more than 10 draws, we would run out of numbers to draw:
sample(1:10, 12, replace = FALSE)
## Error in sample.int(length(x), size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'
The bag you draw from can contain objects other than numbers:
sample(c("Picard", "Data", "Geordi"), 9, replace = TRUE)
## [1] "Picard" "Data" "Geordi" "Geordi" "Data" "Data" "Picard" "Data"
## [9] "Geordi"
So far, each object in the bag had the same probability of being drawn.
You can use the sample command to produce a
weighted sample, too. For example, if we wanted to simulate
\(10\) draws from the following
distribution
| 1 | 2 | 3 |
|---|---|---|
| 0.2 | 0.7 | 0.1 |
we would use the additional argument prob:
sample(c(1, 2, 3), 10, replace = TRUE, prob = c(0.2, 0.7, 0.1))
## [1] 1 2 2 1 1 2 2 3 2 2
Note how it is mostly \(2\)s, as expected.
Draw a sample of size \(n=10\) from \(N(1,2)\), i.e., from the normal distribution with parameters \(\mu=1\), \(\sigma = 2\). Plot a histogram of the obtained values. Repeat for \(n=100\) and \(n=100000\).
x = rnorm(10, mean = 1, sd = 2)
hist(x)
x = rnorm(100, mean = 1, sd = 2)
hist(x)
x = rnorm(100000, mean = 1, sd = 2)
hist(x)
Comments (R): It cannot be simpler! You use the command
hist, feed it a vector of values, and it produces a
histogram. It will even label the axes for you. If you want to learn how
to tweak various features of your histogram, type
?hist.
Esthetically, the built-in histograms leave something to be desired.
We can do better, using the package ggplot2. You don’t have
to use it in this class, but if you want to, you install it first by
running install.packages("ggplot2") (you have to do this
only once). Then, every time you want to use it, you run
library(ggplot2) to notify R that you are about to use a
function from that package. It would take a whole semester to learn
everything there is to know about ggplot2; I will only show
what a histogram looks like in it:
library(ggplot2)
z = rnorm(100000, mean = 1, sd = 2)
ggplot(data=as.data.frame(z), aes(x=z))+
geom_histogram(bins=50, fill="white", color="DarkRed")
Comments (Math):. Mathematically, histogram can be produced
for any (finite) sequence of numbers: we divide the range into several
bins, count how many of the points in the sequence falls into each bin,
and then draw a bar above that bin whose height is equal (or
proportional to) that count. The picture tells use about how the
sequence we started from is “distributed”. The order of the points does
not matter - you would get exactly the same picture if you sorted the
points first. If the sequence of points you draw the histogram of comes
from, say, normal distribution, the histogram will resemble the shape of
the pdf of a normal distribution. I say resemble, because its shape is
ultimately random. If the number of points is small (like in the second
part of this problem) the histogram may look nothing like the normal
pdf. However, when the number of points gets larger and larger, the
shape of the histogram gets closer and closer to the underlying pdf (if
it exists). I keep writing “shape” because the three histograms above
have very different scales on the \(y\)
axis. That is because we used counts to set the vertical sizes of bins.
A more natural choice is to use the proportions, i.e. relative
frequencies (i.e. counts divided by the total number of points) for bar
heights. More precisely, the bar height \(h\) over the bin \([a,b]\) is chosen so that the area of the
bar, i.e., \((b-a)\times h\) equals to
the proportion of all points that fall inside \([a,b]\). This way, the total area under the
histogram is always \(1\). To draw such
a density histogram in R we would need to add the
additional option freq = FALSE to hist:
x = rnorm(100000, mean = 1, sd = 2)
hist(x, freq = FALSE)
Note how the \(y\)-axes label changed from “Frequency” to “Density”. With such a normalization, the histogram of \(x\) can be directly compared to the probability density of a normal distribution. Here is a histogram of \(100,000\) simulations from our normal distribution with its density function (pdf) superimposed:
sims = rnorm(10000, mean = 1, sd = 2)
x = seq(-6, 8, by = 0.02)
y = dnorm(x, mean = 1, sd = 2)
hist(sims, freq = FALSE, main = "")
points(x, y, type = "l", lwd = 3, col = "red")
Let x contain \(2,000\)
draws from \(N(0,1)\), z
another \(2,000\) draws from \(N(0,1)\) and let y=x^2+z.
Draw a scatterplot of x and y to
visualize the joint distribution of x and
y
Plot two histograms, one of x and one of
y. Do they tell the whole story about the joint
distribution of x and y?
Are x and y correlated? Do
x and y in your plot “look independent”? Use
the permutation test to check of independence between x and
y.
x = rnorm(2000)
z = rnorm(2000)
y = x^2 + z
plot(x, y)
hist(x)
hist(y)
No, the two histograms would not be enough to describe the joint distribution. There are many ways in which two random variables \(X\) and \(Y\) can be jointly distributed, but whose separate (marginal) distributions match the histograms above. To give a very simple example, let \(X\) and \(Y\) be discrete random variables, each of which can only take values \(0\) or \(1\). Consider the following two possible joint distribution tables for the random pair \((X,Y)\):
|
|
In both cases, the marginals are the same, i.e., both \(X\) and \(Y\) are equally likely to take the value \(0\) or \(1\), i.e., they both have the Bernoulli distribution with parameter \(p=1/2\). That would correspond to the separate histograms to be the same. On the other hand, their joint distributions (aka dependence structures) are completely different. In the first (left) case, \(X\) and \(Y\) are independent, but in the second they are completely dependent.
They are probably not correlated since the sample correlation between
x and y is close to \(0\):
(cor(x, y))
## [1] -0.02880239
but they do not look independent.
To apply the permutation test, we first plot the scatterplot of
x vs. y as above. Then, we replace
y by a vector with the same components, but randomly
permute their positions, and then plot a scatterplot again. We repeat
this three times:
y_perm_1 = sample(y)
y_perm_2 = sample(y)
y_perm_3 = sample(y)
plot(x, y)
plot(x, y_perm_1)
plot(x, y_perm_2)
plot(x, y_perm_3)
The conclusion is clear, the first (upper-left) plot is very
different than the other three. Therefore, x and
y are probably not independent.
Comments (Math): The point of this problem is to review the notion of the joint distribution between two random variables. The most important point here is that there is more to the joint distribution of two random vectors, than just the two distributions taken separately. In a sense, the whole is (much) more than the sum of its parts. This is something that does not happen in the deterministic world. If you give me the \(x\)-coordinate of a point, and, separately, its \(y\)-coordinate, I will be able to pinpoint the exact location of that point.
On the other hand, suppose that the \(x\)-coordinate of a point is unknown, so we treat it as a random variable, and suppose that this variable admits the standard normal distribution. Do the same for \(y\). Even with this information, you cannot say anything about the position of the point \((x,y)\). It could be that the reason we are uncertain about \(x\) and the reason we are uncertain about \(y\) have nothing to do with each other; in that case we would be right to assume that \(x\) and \(y\) are independent. If, on the other hand, we got the values of both \(x\) and \(y\) by measuring them using the same, inaccurate, tape measure, we cannot assume that the errors are independent. It is more likely that both \(x\) and \(y\) are too big, or both \(x\) and \(y\) are too small.
Mathematically, we say that random variables \(X\) and \(Y\) are independent if \[{\mathbb{P}}[X \in [a,b]] \times {\mathbb{P}}[ Y
\in [c,d] ] = {\mathbb{P}}[ X\in [a,b] \text{ and } Y\in [c,d]]\text{
for all } a,b,c,d.\] While up to the point, this definition is
not very eye-opening, or directly applicable in most cases. Intuitively,
\(X\) and \(Y\) are independent if the distribution of
\(Y\) would not change if we received
additional information about \(X\). In
our problem, random variables \(X\) and
\(Y\) correspond to vectors
x and y. Their scatterplot above clearly
conveys the following message: when x is around \(-2\), we expect y to be around
4, while when x is around \(0\), y would be expected to be
around \(0\), too.
Sometimes, it is not so easy to decide whether two variables are
independent by staring at a scatterplot. What would you say about the
scatterplot below?
The permutation test is designed to help you decide
when two (simulated) random variables are likely to be independent. The
idea is simple. Suppose that
x and y are
simulations from two independent (not necessarily identical)
distributions; say x=runif(1000) and
y=rnorm(1000). The vector y_perm=sample(y) is
a randomly permuted version of y (see R section below) and
it contains exactly the same information about the distribution of
y as y itself does. Both y and
y_perm will produce exactly the same histogram. Permuting
y, however, “uncouples” it from x. If there
was any dependence between the values of x and
y before, there certainly isn’t any now. In other the joint
distribution of x and y_perm has the same
marginals as the joint distribution of x and
y, but all the (possible) dependence has been removed. What
remains is to compare the scatterplot between x and
y and the scatterplot between x and
y_perm. If they look about the same, we conclude that
x and y are independent. Otherwise, there is
some dependence between them.
One question remains: why did we have to draw three scatterplots of
permuted versions of y? That is because we have only
finitely many data points, and it can happen, by pure chance, that the
permutation we applied to y does not completely scramble
its dependence on x. With a “sample” of three such plots,
we get a better feeling for the inherent randomness in this permutation
procedure, and it is much easier to tell whether “one of these things is
not like the others”. Btw, the random variables in the scatterplot above
are, indeed, independent; here are the \(4\) permutation-test plots to “prove” it:
Unlike univariate (one-variable) distributions which are visualized
using histograms or similar plots, multivariate (several-variable)
distributions are harder to depict. The most direct relative of the
histogram is a 3d histogram. Just like the \(x\)-axis is divided into bins in the
univariate case, in the multivariate case we divide the \(xy\)-plane into regions (squares, e.g.) and
count the number of points falling into each of these regions. After
that a 3d bar (a skyscraper) is drawn above each square with the height
of each skyscraper equal (or proportional) to the number of points which
fall into its base. Here is a 3d histogram of our original pair
(x,y) from the problem. You should be able to
rotate and zoom it right here in the notes, provided your browser has
JavaScript enabled:
## Warning: `includeHTML()` was provided a `path` that appears to be a complete HTML document.
## ✖ Path: pics/3dhist.html
## ℹ Use `tags$iframe()` to include an HTML document. You can either ensure `path` is accessible in your app or document (see e.g. `shiny::addResourcePath()`) and pass the relative path to the `src` argument. Or you can read the contents of `path` and pass the contents to `srcdoc`.
A visualization solution that requires less technology would start the same way, i.e., by dividing the \(xy\) plane into regions, but instead of the third dimension, it would use different colors to represent the counts. Here is an example where the regions are hexagons, as opposed to squares; it just looks better, for some reason:
Just to showcase the range of possibilities, here is another
visualization technique which which requires deeper statistical tools,
namely the density contour plot:
Comments (R): There is very little new R here. You should
remember that if x and y are vectors of the
same length, plot(x,y) gives you a scatterplot of
x and y.
To compute the sample correlation between two vectors, use the
cor.
We used the command sample(y) to obtain a randomly
permuted version of y. The simplicity of this is due to
default parameters of the command sample which we already
learned about. In particular, the default number of samples is exactly
the size of the input vector y and, by default, sampling is
performed without replacement. If you think about it for a
second, you will realize that a sample of size \(n\) from the vector of size \(n\) without replacement is nothing
by a random permutation of y.
You are not required to do this in your submissions, but if you want
to display several plots side-by-side, use the command
par(mfrow=c(m,n)) before the command plot. It
tells R to plot the next \(mn\) plots
in a \(m\times n\) grid.
Let the random variables \(X\) and \(Y\) have the joint distribution given by the following table:
| 1 | 2 | 3 | |
|---|---|---|---|
| 1 | 0.1 | 0.2 | 0.3 |
| 2 | 0.2 | 0.2 | 0.0 |
Simulate \(10,000\) draws from the distribution of \((X,Y)\) and display a contingency table of your results.
joint_distribution_long = data.frame(
x = c(1, 1, 1, 2, 2, 2),
y = c(1, 2, 3, 1, 2, 3)
)
probabilities_long =
c(0.1, 0.2, 0.3, 0.2, 0.2, 0.0)
sampled_rows = sample(
x = 1:nrow(joint_distribution_long),
size = 10000,
replace = TRUE,
prob = probabilities_long
)
draws = joint_distribution_long[sampled_rows, ]
table(draws)
## y
## x 1 2 3
## 1 962 2027 3047
## 2 1945 2019 0
Comments (Math): The main mathematical idea is to think of each pair of possible values of \(X\) and \(Y\) as a separate “object”, put all these objects in a “bag”, then then draw from the bag. In other words, we convert the bivariate distribution from the problem to the following univariate distribution
| (1,1) | (1,2) | (1,3) | (2,1) | (2,2) | (2,3) |
|---|---|---|---|---|---|
| 0.1 | 0.2 | 0.3 | 0.2 | 0.2 | 0 |
and sample from it. When you do, you will get a vector whose elements are pairs of numbers. The last step is to extract the components of those pairs into separate vectors.
Comments (R): The most important new R concept here is
data.frame. You should think of it as a spreadsheet. It is,
mathematically, a matrix, but we do not perform any mathematical
operations on it. Moreover, not all columns in the data frame have to be
numeric. Some of them can be strings, and other can be something even
more exotic. You should think of a data frame as a bunch of column
vectors of the same length stacked side by side. It is important to note
that each column of a data frame will have a name, so that we don’t have
to access it by its position only (as we would have to in the case of a
matrix).
In this class, the column vectors of data frames are going to contain simulated values. In statistics, it is data that comes in data frames, with rows corresponding to different observations, and columns to various observed variables.
The easiest way to construct a data frame using already existing vectors is as follows:
x = c(1, 2, 3)
y = c("a", "b", "c")
(df = data.frame(x, y))
## x y
## 1 1 a
## 2 2 b
## 3 3 c
Note that the two columns inherited their names from the vectors
x and y that fed into them. Note, also, that
all rows got consecutive numerical values as names by default. Row names
are sometimes useful to have, but are in general a nuisance and should
be ignored (especially in this class). Column names are more important,
and there is a special notation (the dollar-sign notation) that allows
you to access a column by its name:
df$y
## [1] "a" "b" "c"
If you want to give your columns custom names (or if you are building them out of explicitly given vectors as in the solution above) use the following syntax
z = c("a", "b", "c", "d")
(df = data.frame(letters = z, numbers = c(1, 2, 3, 4)))
## letters numbers
## 1 a 1
## 2 b 2
## 3 c 3
## 4 d 4
A feature that data frames share with vectors and matrices is that
you can use vector indexing as in the following example (where
df is as above)
df[c(2, 4, 4, 1), ]
## letters numbers
## 2 b 2
## 4 d 4
## 4.1 d 4
## 1 a 1
Make sure you understand why the expression inside the brackets is
c(2,4,4,1), and not c(2,4,4,1). R’s desire to
keep row names unique leads to some cumbersome constructs such as
4.1 above. As I mentioned before, just disregard them.
A nice thing about data frames is that they can easily be pretty-printed in RStudio. Go to the Environment tab in one of your RStudio panes, and double click on the name of the data frame you just built. It will appear as a nicely formatted spreadsheet.
Once we have the data frame containing all \(6\) pairs of possible values \(X\) and \(Y\) can take (called
joint_distribution_long in the solution above), we can
proceed by sampling from its rows, by sampling from the set
1,2,3,4,5,6 with probabilities
0.1, 0.2, 0.3, 0.2, 0.2, 0.0. The result of the
corresponding sample command will be a sequence - called
sampled_rows in the solution - of length \(10,000\) composed of numbers \(1,2,3,4,5\) or \(6\). The reason we chose the name
sampled_rows is because each number corresponds to a row
from the data frame joint_distribution_long, and by
indexing joint_distribution_long by
sampled_rows we are effectively sampling from its rows. In
other words, the command
joint_distribution_long[sampled_rows, ] turns a bunch of
numbers into a bunch of rows (many of them repeated) of the data frame
joint_distribution_long.
The final step is to use the function table. This time,
we are applying it to a data frame and not to a vector, but the effect
is the same. It tabulates all possible combinations of values of the
columns, and counts how many times each of them happened. The same
result would have been obtained by calling
table(draws$x, draws$y).
Use Monte Carlo to estimate the expected value of the exponential random variable with parameter \(\lambda= 4\) using \(n=10\), \(n=1,000\) and \(n=1,000,000\) simulations. Compare to the exact value.
x = rexp(10, rate = 4)
mean(x)
## [1] 0.1779768
For an exponential random variable with parameter \(\lambda\), the expected value is \(1/\lambda\) (such information can be found in Appendix A) which, in this case, is \(0.25\). The error made was 0.072023 for \(n=10\) simulations.
We increase the number of simulations to \(n=1000\) and get a better result
x = rexp(1000, rate = 4)
mean(x)
## [1] 0.2564643
with (smaller) error -0.0064643. Finally, let’s try \(n=1,000,000\):
x = rexp(1000000, rate = 4)
mean(x)
## [1] 0.250381
The error is even smaller -0.00038101.
Comments (R): The only new thing here is the command
mean which computes the mean of a vector.
Comments (Math): There is a lot going on here conceptually. This is the first time we used the Monte Carlo method. It is an incredibly useful tool, as you will keep being reminded throughout this class. The idea behind it is simple, and it is based on the Law of large numbers:
Theorem Let \(X_1,X_2,
\dots\) be an independent sequence of random variables with the
same distribution, for which the expected value can be computed. Then
\[ \tfrac{1}{n} \Big( X_1+X_2+\dots+X_n\Big)
\to {\mathbb{E}}[X_1] \text{ as } n\to\infty\] The idea behind
Monte Carlo is to turn this theorem “upside down”. The goal is to
compute \({\mathbb{E}}[X_1]\) and use a
supply of random numbers, each of which comes from the same
distribution, to accomplish that. The random number generator inside
rexp gives us a supply of numbers (stored in the vector
x) and all we have to do is compute their average. This
gives us the left-hand side of the formula above, and, if \(n\) is large enough, we hope that this
average does not differ too much from its theoretical limit. As \(n\) gets larger, we expect better and
better results. That is why your error above gets smaller as \(n\) increases.
It looks like Monte Carlo can only be used to compute the expected value of a random variable, which does not seem like such a bit deal. But it is! You will see in the sequel that almost anything can be written as the expected value of some random variable.
Use Monte Carlo to estimate \({\mathbb{E}}[X^2]\), where \(X\) is a standard normal random variable.
You may or may now know that when \(X\) is standard normal \(Y=X^2\) has a \(\chi^2\) distribution with one degree of freedom. If you do, you can solve the problem like this:
y = rchisq(5000, df = 1)
mean(y)
## [1] 0.9771929
If you don’t, you can do the following:
x = rnorm(5000)
y = x^2
mean(y)
## [1] 1.019852
Comments (Math+R): We are asked to compute \({\mathbb{E}}[ X^2]\), which can be
interpreted in two ways. First, we can think of \(Y=X^2\) as a random variable in its own
right and you can try to take draws from the distribution of \(Y\). In the case of the normal
distribution, the distribution of \(Y\)
is known - it happens to be a \(\chi^2\)-distribution with a single degree
of freedom (don’t worry if you never heard of it). We can simulate it in
R by using its R name chisq and get a number close to the
exact value of \(1\).
If you did not know about the \(\chi^2\) distribution, you would not know
what R name to put the prefix r in front of. What makes the
simulation possible is the fact that \(Y\) is a transformation of a
random variable we know how to simulate. In that case, we simply
simulate the required number of draws x from the normal
distribution (using rnorm) and then apply the
transformation \(x \mapsto x^2\) to the
result. The transformed vector y is then nothing but the
sequence of draws from the distribution of \(X^2\).
The idea described above is one of main advantages of the Monte Carlo technique: if you know how to simulate a random variable, you also know how to simulate any (deterministic) function of it. That fact will come into its own a bit later when we start working with several random variables and stochastic processes, but it can be very helpful even in the case of a single random variable, as you will see in the next problem.
Let \(X\) be a standard normal random variable. Use Monte Carlo to estimate the probability \({\mathbb{P}}[ X > 1 ]\). Compare to the exact value.
The estimated probability:
x = rnorm(10000)
y = x > 1
(p_est = mean(y))
## [1] 0.1608
The exact probability and the error
p_true = 1 - pnorm(1)
(err = p_est - p_true)
## [1] 0.002144746
Comments (R): As we learned before, the symbol
> is an operation, which returns a Boolean
(TRUE or FALSE) value. For example:
1 > 2
## [1] FALSE
5^2 > 20
## [1] TRUE
It is vectorized:
x = c(1, 2, 4)
y = c(5, -4, 3)
x > y
## [1] FALSE TRUE TRUE
and recycling rules apply to it (so that you can compare a vector and a scalar, for example)
x = 1:10
x > 5
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
Therefore, the vector y in the solution is a vector of
length \(10000\) whose elements are
either TRUE or FALSE; here are the first 5
rows of data frame with columns x and y from
our solution:
| x | y |
|---|---|
| 1.9493 | FALSE |
| -1.1015 | TRUE |
| 1.0448 | TRUE |
| -0.1384 | TRUE |
| -0.2573 | TRUE |
Finally, z contains the mean of
y. How do you compute a mean of Boolean values? In R (and
many other languages) TRUE and FALSE have
default numerical values, usually \(1\)
and \(0\). This way, when \(R\) is asked to compute the
sum of a Boolean vector it will effectively count the
number of values which are TRUE. Similarly, the
mean is the relative proportion of TRUE
values.
Comments (Math): We computed the proportion of the “times” \(X>1\) (among many simulations of \(X\)) and used it to approximate the probability \({\mathbb{P}}[ X>1]\). More formally, we started from a random variable \(X\) with a normal distribution and then transformed it into another random variable, \(Y\), by setting \(Y=1\) whenever \(X>1\) and \(0\) otherwise. This is often written as follows \[ Y = \begin{cases} 1, & X>1 \\ 0, & X\leq 1.\end{cases}\] The random variable \(Y\) is very special - it can only take values \(0\) and \(1\) (i.e., its support is \(\{0,1\}\)). Such random variables are called indicator random variables, and their distribution, called the Bernoulli distribution, always looks like this:
| 0 | 1 |
|---|---|
| 1-p | p |
for some \(p \in [0,1]\). The parameter \(p\) is nothing but the probability \({\mathbb{P}}[Y=1]\).
So why did we decide to transform \(X\) into \(Y\)? Because of the following simple fact: \[ {\mathbb{E}}[ Y] = 1 \times p + 0 \times (1-p) = p.\] The expected value of an indicator is the probability \(p\), and we know that we can use Monte Carlo whenever we can express the quantity we are computing as an expected value of a random variable we know how to simulate.
Many times, simulating a random variable is easier than analyzing it analytically. Here is a fun example:
Use Monte Carlo to estimate the value of \(\pi\) and compute the error.
nsim = 1000000
x = runif(nsim,-1, 1)
y = runif(nsim,-1, 1)
z = (x ^ 2 + y ^ 2) < 1
(pi_est = 4 * mean(z))
## [1] 3.141728
(err = pi_est - pi)
## [1] 0.0001353464
Comments (Math):
As we learned in the previous problem, probabilities of events can be
computed using Monte Carlo, as long as we know how to simulate the
underlying indicator random variable. In this case, we want to compute
\(\pi\), so we would need to find a
“situation” in which the probability of something is \(\pi\). Of course, \(\pi>1\), so it cannot be a probability
of anything, but \(\pi/4\) can, and
computing \(\pi/4\) is as useful as
computing \(\pi\). To create the
required probabilistic “situation” we think of the geometric meaning of
\(\pi\), and come up with the following
scheme. Let \(X\) and \(Y\) be two independent uniform random
variables each with values between \(-1\) and \(1\). We can think of the pair \((X,Y)\) as a random point in the square
\([-1,1]\times [-1,1]\). This point
will sometimes fall inside the unit circle, and sometimes it will not.
What is the probability of hitting the circle? Well, since \((X,Y)\) is uniformly distributed everywhere
inside the square, this probability should be equal to the portion of
the area of our square which belongs to the unit circle. The area of the
square is \(4\) and the area of the
circle is \(\pi\), so the required
probability is \(\pi/4\). Using the
idea from the previous problem, we define the indicator random variable
\(Z\) as follows \[ Z = \begin{cases} 1 & (X,Y) \text{ is inside
the unit circle, } \\ 0 & \text{ otherwise.}
\end{cases}
= \begin{cases} 1& X^2+Y^2 < 1, \\ 0 & \text{ otherwise.}
\end{cases}\]
1. Write an R function cumavg which computes the
sequence of running averages of a vector, i.e., if the input is \(x=(x_1,x_2,x_3,\dots, x_n)\), the output
should be \[ \Big(x_1, \frac{1}{2} (x_1+x_2),
\frac{1}{3}(x_1+x_2+x_3), \dots, \frac{1}{n}
(x_1+x_2+\dots+x_n)\Big).\] Test it to check that it really
works. (Hint: look up the function cumsum. )
cumavg to the vector \(4 z\) from the previous problem and plot
your results (use a smaller value for nsim. Maybe \(1000\).) Plot the values against their
index. Add a red horizontal line at the level \(\pi\). Rerun the same code (including the
simulation part) several times.
cumsum, the problem
becomes much easier.cumavg = function(x) {
c = cumsum(x)
n = 1:length(x)
return(c/n)
}
x = c(1, 3, 5, 3, 3, 9)
cumavg(x)
## [1] 1 2 3 3 3 4
nsim = 1000
x = runif(nsim,-1, 1)
y = runif(nsim,-1, 1)
z = (x ^ 2 + y ^ 2) < 1
pi_est = cumavg(4 * z)
plot(
1:nsim,
pi_est,
type = "l",
xlab = "number of simulations",
ylab = "estimate of pi",
main = "Computing pi by Monte Carlo"
)
abline(pi, 0,
col = "red")
Comments (R): This course is not about R graphics, but I
think it is a good idea to teach you how to make basic plots in R. We
already used the function plot to draw scatterplots. By
default, each point drawn by plot is marked by a small
circle so it might not seem like a good idea to use it. Luckily this,
and many other things, can be adjusted by numerous additional arguments.
One of such arguments is type which determines the type of
the plot. We used type="l" which tells R to join the points
with straight lines:
x = c(1, 3, 4, 7)
y = c(2, 1, 5, 5)
plot(x, y, type = "l")
The other arguments,
xlab, ylab and
main determine labels for axes and the entire plot. The
function abline(a,b) adds a line \(y = a x + b\) to an already existing plot.
It is very useful in statistics if one wants to show the regression line
superimposed on the scatterplot of data. Finally, the argument
col, of course, determines the color of the line. To learn
about various graphical parameters, type ?par.
Comments (Math): The conceptual reason for this exercise is to explore (numerically) the kinds of errors we make when we use Monte Carlo. Unlike the deterministic numerical procedures, Monte Carlo has a strange property that no bound on the error can be made with absolute certainty. Let me give you an example. Suppose that you have a biased coin, with the probability \(0.6\) of heads and \(0.4\) of tails. You don’t know this probability, and use a Monte Carlo technique to estimate it - you toss your coin \(1000\) times and record the number of times you observe \(H\). The law of large numbers suggests that the relative frequency of heads is close to the true probability of \(H\). Indeed, you run a simulation
x = sample(c("T", "H"), 1000, prob = c(0.4, 0.6), replace = TRUE)
y = x == "H"
mean(y)
## [1] 0.594
and get a pretty accurate estimate of \(0.594\). If you run the same code a few more times, you will get different estimates, but all of them will be close to \(0.6\). Theoretically, however, your simulation could have yielded \(1000\) Hs, which would lead you to report \(p=1\) as the Monte-Carlo estimate. The point is that even though such disasters are theoretically possible, they are exceedingly unlikely. The probability of getting all \(H\) in \(1000\) tosses of this coin is a number with more than \(500\) zeros after the decimal point.
The take-home message is that even though there are no guarantees,
Monte Carlo performs well the vast majority of the time. The crucial
ingredient, however, is the number of simulations. The plot you were
asked to make illustrates exactly that. The function cumavg
gives you a whole sequence of Monte-Carlo estimates of the same thing
(the number \(\pi\)) with different
numbers of simulations nsim. For small values of
nsim the error is typically very large (and very random).
As the number of simulations grows, the situations stabilizes and the
error decreases. Without going into the theory behind it, let me only
mention is that in the majority of practical applications we have the
following relationship: \[ error \sim
\frac{1}{\sqrt{n}}.\] In words, if you want to double the
precision, you need to quadruple the number of simulations. If you want
an extra digit in your estimate, you need to multiply the number of
simulations by \(100\). Here is an
image where I superimposed \(40\) plots
like the one you were asked to produce (the dashed lines are \(\pm \frac{4}{\sqrt{n}}\)):
Let \(X\) and \(Y\) be two independent geometric random variables with parameters \(p=0.5,\) and let \(Z=X+Y\). Compute \({\mathbb{P}}[ X = 3| Z = 5]\) using simulation. Compare to the exact value.
By simulation:
nsim = 1000000
X = rgeom(nsim, prob = 0.5)
Y = rgeom(nsim, prob = 0.5)
Z = X + Y
X_cond = X[Z == 5]
mean(X_cond == 3)
## [1] 0.1684758
To get the exact value, we start from the definition: \[ {\mathbb{P}}[ X = 3 | Z= 5 ] = \frac{{\mathbb{P}}[ X=3 \text{ and }Z=5]}{{\mathbb{P}}[Z=5]} = \frac{{\mathbb{P}}[X=3 \text{ and }Y = 2]}{{\mathbb{P}}[Z=5]}, \] where the last equality follows from the fact that \(\{ X=3 \text{ and } Z=5 \}\) is exactly the same event as \(\{ X = 3 \text{ and } Y=2\}\). Since \(X\) and \(Y\) are independent, we have \[{\mathbb{P}}[ X=3 \text{ and }Y=2 ] = {\mathbb{P}}[X=3] \times {\mathbb{P}}[ Y=2] = 2^{-4} 2^{-3} = 2^{-7}.\] To compute \({\mathbb{P}}[ Z = 5]\) we need to split the event \(\{ Z = 5 \}\) into events we know how to deal with. Since \(Z\) is built from \(X\) and \(Y\), we write \[ \begin{align} {\mathbb{P}}[ Z = 5 ] = &{\mathbb{P}}[X=0 \text{ and }Y=5]+ {\mathbb{P}}[ X=1 \text{ and }Y=4] + {\mathbb{P}}[ X=2 \text{ and }Y=3] + \\ & {\mathbb{P}}[ X=3 \text{ and }Y=2] + {\mathbb{P}}[ X=4 \text{ and }Y=1] + {\mathbb{P}}[ X = 5 \text{ and }Y=0]. \end{align}\] Each of the individual probabilities in the sum above is \(2^{-7}\), so \({\mathbb{P}}[ X = 3 | Z = 5] = \frac{1}{6}\). This gives us an error of 0.0018091.
Comments (Math): Let us, first, recall what the conditional probability is. The definition we learn in the probability class is the following \[ {\mathbb{P}}[A | B] = \frac{{\mathbb{P}}[A \text{ and }B]}{{\mathbb{P}}[B]},\] as long as \({\mathbb{P}}[B]>0\). The interpretation is that \({\mathbb{P}}[A|B]\) is still the probability of \(A\), but now in the world where \(B\) is guaranteed to happen. Conditioning usually happens when we receive new information. If someone tells us that \(B\) happened, we can disregard everything in the complement of \(B\) and adjust our probability to account for that fact. First we remove from \(A\) anything that belongs to the complement of \(B\), and recompute the probability \({\mathbb{P}}[A \cap B]\). We also have to divide by \({\mathbb{P}}[B]\) because we want the total probability to be equal to \(1\).
Our code starts as usual, but simulating \(X\) and \(Y\) from the required distribution, and
constructing a new vector \(Z\) as
their sum. The variable X_cond is new; we build it from
\(X\) by removing all the elements
whose corresponding \(Z\) is
not equal to \(5\). This is an
example of what is sometimes called the rejection
method in simulation. We simply “reject” all simulations which
do not satisfy the condition we are conditioning on. We can think of
X_cond as bunch of simulations of \(X\), but in the world where \(Z=5\) is guaranteed to happen. Once we have
X_cond, we proceed as usual by computing the relative
frequency of the value \(3\) among all
possible values \(X\) can take. Note
that the same X_cond can also be used to compute the
conditional probability \({\mathbb{P}}[ X=1|
Z=5]\). In fact, X_cond contains the information
about the entire conditional distribution of \(X\) given \(Z=5\); if we draw a histogram of
X_cond, we will get a good idea of what this distribution
looks like:
Since X_cond contains only discrete values from \(0\) to \(5\), a contingency table might be a better
tool for understanding its distribution:
| 0 | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| 7745 | 7761 | 7691 | 7807 | 7731 | 7604 |
The histogram and the table above suggest that the distribution of \(X\), given \(Z=5\), is uniform on \(\{0,1,2,3,4,5\}\). It is - a calculation almost identical to the one we performed above gives that \({\mathbb{P}}[ X= i| Z=5] = \frac{1}{6}\) for each \(i=0,1,2,3,4,5\).
One more observation at the end. Note that we drew \(n=1,000,000\) simulations this time. While
it is probably an overkill for this particular example, conditional
probabilities in general require more simulations than unconditional
ones. Of course, that is because we reject most of our original draws.
Indeed, the size of the vector X_cond is 46339 - more than
a \(20\)-fold decrease. This fact
becomes particularly apparent when we try to use Monte Carlo for
conditional distributions associated with continuous random
vectors as we will see in out next problem.
Let \(X\) and \(Y\) be independent random variables where \(X\) has the \(N(0,1)\) distribution and \(Y\) the exponential distribution with parameter \(\lambda=1\). Find a graphical approximation to the conditional density of \(Y\), given \(X+Y\geq 1\). Repeat the same, but condition on \(X+Y=1\).
nsim = 100000
x = rnorm(nsim)
y = rexp(nsim)
cond = x + y >= 1
x_cond = x[cond]
hist(x_cond, breaks = 100)
nsim = 100000
eps = 0.1
x = rnorm(nsim)
y = rexp(nsim)
cond = (1 - eps < x + y) & (x + y < 1 + eps)
x_cond = x[cond]
hist(x_cond, breaks = 100)
Comments (Math): In the case of conditioning on \(X+Y\geq 1\) we repeated the same procedure as in the discrete case. We simply rejected all draws that do not satisfy the condition.
When Conditioning on \(X+Y=1\),
however, you immediately encounter a problem that you don’t get with
discrete distributions. The event \(\{
X+Y=1\}\) has probability \(0\)
and will never happen. That means that our strategy form the previous
problem will simply not work - you will reject all
draws. The problem goes beyond a particular approach to the problem, as
the conditional probabilities such as \({\mathbb{P}}[ Y \geq 0 | X+Y=1]\) are not
well defined. Indeed, the formula \[
{\mathbb{P}}[ Y \geq 0 | X+Y=1] "=" \frac{{\mathbb{P}}[ Y\geq
0 \text{ and } X+Y=1]}{ {\mathbb{P}}[X+Y=1]}\] requires that the
probability in the denominator be strictly positive. Otherwise you are
dividing by zero. The theoretical solution to this is by no means simple
and requires mathematics beyond the scope of these notes. Practically,
there is a very simple way of going around it. Instead of conditioning
on the zero-probability event \(X+Y=1\), we use a slightly more relaxed
condition \[ X+Y \in (1-\varepsilon,
1+\varepsilon) \] for a small, but positive, \(\varepsilon\). In many cases of interest,
this approximation works very well, as long as \(\varepsilon\) is not too big. How big?
Well, that will depend on the particular problem, as well as on the
number of simulations you are drawing. The best way is to try several
values and experiment. For example, if we chose \(\varepsilon=0.01\) in our problem, the
number of elements in x_cond (i.e., the number of
non-rejected draws) would be on the order of \(100\), which may be considered to small to
produce an accurate histogram. On the other hand, when \(\varepsilon=1\), your result will be
inaccurate because you are conditioning on the event \(0 < X+Y < 2\) which is a poor
approximation for \(X+Y=1\). The rule
of thumb is to take the smallest \(\varepsilon\) you can, while keeping the
number of non-rejected draws sufficiently large.
Find the Weibull distribution in R’s help system. Simulate \(n=10000\) draws from the Weibull distribution with shape parameter \(2\) and scale parameter \(3\). Draw a histogram of your simulations.
Suppose that the vector x contains \(n=10000\) simulations from the standard
normal \(\mu=0, \sigma=1)\). Without
simulating any new random numbers, transform it into the vector
y such that y is a vector of \(n=10000\) simulations from the normal with
\(\mu=1\) and \(\sigma=0.5\). Draw histograms of both
x and y on the same plot. (Note: the
extra parameter add is used to superimpose plots. You may
want to use different colors, too. Use the parameter col
for that. )
Starting with x=seq(-3,3,by=0.1), define the
appropriate vector y and use x and
y to plot the graph of the cdf of the standard normal. The
command you want to use is plot with the following extra
arguments
type="l" (to get a smooth line instead of a bunch of
points).main="The CDF of the standard normal" (to set the
title), and
The R name for the Weibull distribution is weibull
and the arguments names corresponding to the shape and scale parameters
are shape and scale:
x = rweibull(10000, shape = 2, scale = 3)
hist(x)
Let \(X\) be a normally distributed random variable, with parameters \(\mu_X\) and \(\sigma_X\). When we apply a linear transformation \(Y = \alpha X + \beta\) to X, the result \(Y\) has a normal distribution again, but with different parameters. These parameters, call them \(\mu_Y\) and \(\sigma_Y\), are easily identified by taking the expected value and the variance:
\[\begin{align} \mu_Y & = {\mathbb{E}}[Y] = \alpha {\mathbb{E}}[X] + \beta = \alpha \mu_X + \beta \\ \sigma_Y^2 & = \operatorname{Var}[Y] = \operatorname{Var}[\alpha X + \beta] = \alpha^2 \operatorname{Var}[X] = \alpha^2 \sigma_X^2 \end{align}\]
In the problem we are given \(\mu_X=0\) and \(\sigma_X=1\), so we must take \(\alpha = 0.5\) and \(\beta=1\) to get \(\mu_Y=1\) and \(\sigma_Y=0.5\) (note that this is exactly the opposite of taking \(z\)-scores, where we transform a general normal into the standard normal). In R
x = rnorm(10000)
y = 0.5 * x + 1
Let’s check that the parameters of y are as as
required:
(mean(y))
## [1] 1.002488
(sd(y))
## [1] 0.4989339x = seq(-3, 3, by = 0.1)
y = pnorm(x)
plot(x, y, type = "l", ylab = "F(x)", main = "The CDF of the standard normal")
Simulate \(n=1000\) draws from the distribution whose distribution table is given by
|
2 |
4 |
8 |
16 |
|---|---|---|---|
|
0.2 |
0.3 |
0.1 |
0.4 |
Draw a histogram of your results.
You may have learned in probability how to compute the pdf \(f_Y(y)\) of a transformation \(Y=g(X)\) of a random variable with pdf \(f_X(x)\). Suppose that you forgot how to do that, but have access to \(10,000\) simulations from the distribution of \(X\). How would you get an approximate idea about the shape of the function \(f_Y\)?
More concretely, take \(X\) to be exponentially distributed with parameter \(1\) and \(g(x) = \sin(x)\) and produce a picture that approximates the pdf \(f_Y\) of \(Y\). (Note: even if you remember how to do this analytically, you will run into a difficulty. The function \(\sin(x)\) is not one-to-one and the method usually taught in probability classes will not apply. If you learned how to do it in the many-to-one case of \(g(x)= \sin(x)\), kudos to your instructor!)
Let \(X\) be a random variable with the Cauchy distribution, and \(Y = \operatorname{arctan}(X)\). R allows you to simulate from the Cauchy distribution, even if you do not know what it is. How would you use that to make an educated guess as to what the distribution of \(Y\) is? To make your life easier, consider \(\tfrac{2}{\pi} Y\) first.
x = sample(c(2, 4, 8, 16), size = 10000, prob = c(0.2, 0.3, 0.1, 0.4), replace = TRUE)
hist(x)
Note: given that we are dealing with a discrete distribution, a contingency table might be a better choice:
|
2 |
4 |
8 |
16 |
|---|---|---|---|
|
2044 |
2945 |
1045 |
3966 |
We apply the function \(\sin\) to the simulations. The histogram of the obtained values is going to be a good (graphical) approximation to the pdf of the transformed random variable:
x = rexp(100000)
y = sin(x)
hist(y)
Having learned that histograms look like the pdfs of the underlying distributions, we draw the histogram:
x = rcauchy(10000)
y = atan(x) * 2/pi
hist(y)
It looks uniform (if we replace \(10,000\) by \(100,000\)&_t + it will look even more uniform). We conclude that \(2/\pi \arctan(X)\) is probably uniformly distributed on \((-1,1)\). Hence, \(Y = \arctan(X)\) is probably uniformly distributed on \((-\pi/2, \pi/2)\).
A basic method for obtaining simulations draws from distributions
other than the uniform is the transformation method.
The idea is to start with (pseudo) random numbers, i.e., draws from the
uniform \(U(0,1)\) distribution, and
then apply a function \(g\) to each
simulation. The difficulty is, of course, how to choose the right
function \(g\).
Let \(X\) be a random variable with a continuous
and strictly increasing cdf \(F\). What
is the distribution of \(Y=F (X)\)?
What does that have to do with the transformation method?
(Hint: if you are having difficulty with this problem, feel free to run some experiments using R. )
Let us perform an experiment where \(X \sim
N(0,1)\). Remembering that the cdf is given by the R function
pnorm:
x = rnorm(100000)
y = pnorm(x)
hist(y)
This looks like a histogram of a uniform distribution on \((0,1)\). Let’s try with some other continuous distributions
x1 = rexp(100000)
x2 = rcauchy(100000)
x3 = runif(100000)
x4 = rgamma(100000, shape = 3)
par(mfrow = c(2, 2))
hist(pexp(x1))
hist(pcauchy(x2))
hist(punif(x3))
hist(pgamma(x4, shape = 3))
All of those point to the same conjecture, namely that \(F(X)\) is uniformly distributed on \((0,1)\). To prove that, we take \(Y=F(X)\) and try to compute that cdf \(F_Y\) of \(Y\): \[F_Y(y) = {\mathbb{P}}[ Y \leq y] = {\mathbb{P}}[ F(X) \leq y]\] Since \(F\) is strictly increasing, it admits an inverse \(F^{-1}\). Moreover, for any \(y \in (0,1)\), the set of all values of \(x\) such that \(F(x)\leq y\) (the red range) is exactly the interval \((-\infty, F^{-1}(y)]\) (the blue range), as in the picture below:
Hence, \[F_Y(y)={\mathbb{P}}[Y\leq y] = {\mathbb{P}}[ F(X) \leq y] = {\mathbb{P}}[ X \leq F^{-1}(y) ] = F(F^{-1}(y)) = y, \text{ for } y\in (0,1).\] The cdf \(F_Y\) is, therefore, equal to the cdf of a uniform on \((0,1)\). Since the cdf uniquely determines the distribution, \(Y\) must be uniformly distributed on \((0,1)\).
Let \(f_1\) and \(f_2\) be two pdfs. We take a constant \(\alpha \in (0,1)\) and define the function
\(f\) by \[
f(x) = \alpha f_1(x) + (1-\alpha) f_2(x).\] The function \(f\) is the pdf of a third distribution,
which is called the mixture of \(f_1\) and \(f_2\) with weights \(\alpha\) and \(1-\alpha\). Assuming that you know
how to simulate from the distributions with pdfs \(f_1\) and \(f_2\), how would you draw \(10,000\) simulations from the mixture \(f\)? Show your method on the example of a
mixture of \(N(0,1)\) and \(N(4,1)\) with \(\alpha=2/3\). Plot the histogram of the
obtained sample (play with the parameter breaks until you
get a nice picture.)
(Hint: start with two vectors, the first containing \(10,000\) simulations from \(f_1\) and the second from \(f_2\). Then “toss” \(10,000\) biased coins with \(\mathbb{P}[ H ] = \alpha\) … )
The double exponential or Laplace distribution is a continuous probability distribution whose pdf is given by \[ \tfrac{1}{2} \exp(-|x|), x\in {\mathbb R}.\] This distribution is not built into R. How would you produce simulations from the double exponential using R?
The idea is that before each draw a biased coin (with \({\mathbb{P}}[H]=\alpha\)) is tossed. If
\(H\) is obtained, we draw from the
distribution with pdf \(f_1\).
Otherwise, we draw from the distribution with pdf \(f_2\). We write a function which performs
one such simulation, and then use the command replicate to
call it several times and store the results in the vector:
single_draw = function() {
coin = sample(c(1, 2), prob = c(2/3, 1/3), size = 1, replace = TRUE)
if (coin == 1)
return(rnorm(1)) else return(rnorm(1, mean = 4, sd = 1))
}
nsim = 20000
y = replicate(nsim, single_draw())
hist(y)
As you can see, the histogram has two “humps”, one centered around \(0\) and the other centered around \(4\). The first one is taller, which reflects the higher weight (\(\alpha=2/3\)) that \(N(0,1)\) has in this mixture.
If you wanted to write a more succinct vectorized code (which is not necessarily faster in this case), you could also do something like this
nsim = 10000
alpha = 2/3
x1 = rnorm(nsim)
x2 = rnorm(nsim, mean = 4, sd = 1)
coin = sample(c(TRUE, FALSE), size = nsim, prob = c(alpha, 1 - alpha), replace = TRUE)
y = ifelse(coin, x1, x2)
The function ifelse is a vectorized version of the
if-then blok and takes three arguments of equal length. The
first one is a vector of logical values c, and the other
two, x1, x2 only need to be of the same type. The result of
is a vector whose value at the position i is
x1[i] if c[i]==TRUE and x2[i]
otherwise.
The Laplace distribution can be understood as a mixture, with
\(\alpha=1/2\), of two distributions.
The first one is an exponential, and the second one is obtained from by
putting the minus sign in front of it.
Using our strategy from part 1. above, we could get simulations of it as
follows:
nsim = 100000
alpha = 1/2
x1 = rexp(nsim)
x2 = -rexp(nsim) # note the minus sign in front of rexp
coin = sample(c(TRUE, FALSE), size = nsim, prob = c(alpha, 1 - alpha), replace = TRUE)
y = ifelse(coin, x1, x2)
hist(y)
You can do this more efficiently if you realize that every time we
toss a coin and choose between x1 and x2, we
are really choosing the sign in front of an exponentially distributed
random variable. In other words, we can use coin as a
vector of random signs for a vector or draws from the exponential
distribution:
nsim = 10000
alpha = 1/2
x = rexp(nsim)
coin = sample(c(-1, 1), size = nsim, prob = c(alpha, 1 - alpha), replace = TRUE)
y = coin * x
hist(y)
Let x=rnorm(1000) and y=rnorm(1000). For
each of the following pairs, use the permutation test to decide whether
they are independent or not
x^2+y^2 and y^2(x+y)/sqrt(2) and (x-y)/sqrt(2)x and 1x^2+y^2 and atan(y/x).(Note: do not worry about dividing by \(0\) in d. It will happen with probability \(0\).)
Let us start by writing a function to save some keystrokes
permutation_test = function(z, w) {
par(mfrow = c(2, 2))
plot(z, w, asp = 1)
plot(z, sample(w), asp = 1)
plot(z, sample(w), asp = 1)
plot(z, sample(w), asp = 1)
}
x = rnorm(1000)
y = rnorm(1000)
permutation_test(x^2 + y^2, y^2)
The first plot is very different from the other three. Therefore,the vectors are probably not independent.
permutation_test((x + y)/sqrt(2), (x - y)/sqrt(2))
The first plot could easily be confused for one of the other three. Therefore the vectors are probably independent.
# we have to use rep(1,length(x)) to get a vector of 1s of the same length as
# x. R will not recycle it properly if you simply write 1. Another, more
# 'hacky' way would be to take advantage of recycling and use 0*x+1
permutation_test(x, rep(1, length(x)))
The plots look very similar. Therefore, the vectors are probably independent. We could have known this without drawing any graphs. Anything is independent of a constant random variable (vector).
permutation_test(x^2 + y^2, atan(y/x))
Plots look very similar to each other. Therefore, z and
w are probably independent.
Note: The plots in b) and d) reveal that the distribution of the random vector \((X,Y)\) consisting of two independent standard normals is probably rotation invariant. In b) we are asked to compare the coordinates of the vector obtained from \((X,Y)\) by a rotation at \(45\) degrees around the origin. The fact that independence persisted suggests that components remain independent even after a (specific) rotation. If you tried rotations by different angles you would get the same result. The experiment in d) told us that the (squared) distance \(X^2+Y^2\) and angle between \((X,Y)\) and the \(x\)-are independent. This is also something that one would expect from a rotationally-invariant distribution. Indeed, the distribution of the distance to the origin should not depend on the direction.
It is important to note that none of this proves anything. It is simply numerical evidence for a given conclusion.
Simulate \(n=10000\) draws from the joint distribution given by the following table:
| 1 | 2 | 3 | |
|---|---|---|---|
| 1 | 0.1 | 0.0 | 0.3 |
| 2 | 0.1 | 0.1 | 0.0 |
| 3 | 0.0 | 0.0 | 0.4 |
Display the contingency table of your results, as well as a table showing the “errors”, i.e., differences between the theoretical frequencies (i.e., probabilities given above) and the obtained relative frequencies in the sample.
We are using the procedure from Section 2.2 in the notes.
nsim = 10000
joint_distribution_long = data.frame(
x = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
y = c(1, 2, 3, 1, 2, 3, 1, 2, 3)
)
probabilities_long =
c(0.1, 0.0, 0.3,
0.1, 0.1, 0.0,
0.0, 0.0, 0.4)
sampled_rows = sample(
x = 1:nrow(joint_distribution_long),
size = nsim,
replace = TRUE,
prob = probabilities_long
)
draws = joint_distribution_long[sampled_rows,]
(rel_freq = prop.table(table(draws)))
## y
## x 1 2 3
## 1 0.0998 0.0000 0.3017
## 2 0.0993 0.1005 0.0000
## 3 0.0000 0.0000 0.3987
(th_freq = matrix(probabilities_long, byrow = TRUE, nrow = 3))
## [,1] [,2] [,3]
## [1,] 0.1 0.0 0.3
## [2,] 0.1 0.1 0.0
## [3,] 0.0 0.0 0.4
(err = rel_freq - th_freq)
## y
## x 1 2 3
## 1 -0.0002 0.0000 0.0017
## 2 -0.0007 0.0005 0.0000
## 3 0.0000 0.0000 -0.0013
Estimate the following integrals using Monte Carlo
\(\int_0^1 \cos(x)\, dx\)
\(\int_{-\infty}^{\infty}\frac{1}{\sqrt{2\pi}}\frac{e^{-x^2/2}}{1+x^4}\,dx\)
\(\int_0^{\infty} e^{-x^3-x}\, dx\)
\(\int_{-\infty}^{\infty} \frac{\cos(x^2)}{1+x^2}\, dx\) (extra credit)
The idea here is to use the “fundamental theorem of statistics” \[ {\mathbb{E}}[ g(X) ] = \int g(x)\, f_X(x)\, dx \] where \(f_X\) is the pdf of \(X\) and \(g\) is any reasonably well-behaved function. Normally, one would use the integral on the right to compute the expectation on the left. We are flipping the logic, and using the expectation (which we can approximate via Monte Carlo) to estimate the integral on the right.
We pick \(g(x) = \cos(x)\) and \(X\) a r.v. with a uniform distribution on \((0,1)\), so that \(f_X(x) = 1\) for \(x\in (0,1)\) and \(0\) otherwise:
nsim = 10000
x = runif(nsim)
y = cos(x)
mean(y)
## [1] 0.8426696
For comparison, the exact value of the integral is \(\sin(1) \approx 0.841471\).
We cannot use the uniform distribution anymore, because the limits of integration are \(\pm \infty\). Part of the expression inside the integral can be recognized as a (standard) normal density, so we take \(X \sim N(0,1)\) and \(g(x) = 1/(1+x^4)\)
nsim = 10000
x = rnorm(nsim)
y = 1/(1 + x^4)
mean(y)
## [1] 0.6786274
The “exact” value (i.e., very precise approximation to this integral obtained using another numerical method) is \(0.676763\).
We integrate \(g(x) = \exp(-x^3)\) against the exponential pdf \(f_X(x) = \exp(-x)\), for \(x>0\):
nsim = 10000
x = rexp(nsim)
y = exp(-x^3)
mean(y)
## [1] 0.5680925
A close approximation of the true value is \(0.56889\).
In this case, a possible choice of the distribution for \(X\) is the Cauchy distribution (no worries if you never heard about it), whose pdf is \(f_X(x) = \frac{1}{\pi(1+x^2)}\), so that \(g(x) = \pi \cos(x^2)\):
nsim = 10000
x = rcauchy(nsim)
y = pi * cos(x^2)
mean(y)
## [1] 1.320216
The “exact” value is \(1.30561\).
The tricylinder is a solid body constructed as follows: create three cylinders of radius 1 around each of the three coordinate axes and intersect them:
Use Monte Carlo to estimate the volume of the tricylinder and check your estimate against the exact value \(8(2-\sqrt{2})\).
By the very construction, it is clear that the entire tricylinder lies within the cube \([-1,1]\times [-1,1] \times[-1,1]\). Therefore, we can compute its volume by simulating random draws from the uniform distribution in that cube, and computing the relative frequence of those values that fall inside the tricylinder. The whole point is that it is easy to check, given a point \((x,y,z)\), whether it lies inside the tricylinder or not. Indeed, the answer is “yes” if and only if all three of the following inequalities are satisfied: \[ x^2+y^2 \le 1,\ x^2+z^2\leq 1 \text{ and } y^2+z^2\leq 1.\]
nsim = 10000
x = runif(nsim, min = -1, max = 1)
y = runif(nsim, min = -1, max = 1)
z = runif(nsim, min = -1, max = 1)
is_in = (x^2 + y^2 <= 1) & (x^2 + z^2 <= 1) & (y^2 + z^2 <= 1)
(2^3 * sum(is_in)/nsim)
## [1] 4.708
We multiplied by \(2^3\) because that is the volume of the cube \([-1,1]\times [-1,1] \times [-1,1]\). Without it, we would get the portion of the cube taken by the tricylinder, and not its volume.
The true value of \(8(2-\sqrt{2})\) is, approximately, \(4.6862\).
Read about the Monty Hall Problem online (the introduction to its Wikipedia page has a nice description), Use Monte Carlo to compare the two possible strategies (switching and not-switching) and decide which is better.
The host knows where the car is and what contestant’s guess is. If
those two are the same (i.e., contestant guessed right), he will choose
one of the two remaining doors at random. If not, he simply shows the
contestant the other door with the goat behind it. This exactly what the
function show_door implements:
show_door = function(car, guess) {
all_doors = c(1, 2, 3)
goat_doors = all_doors[all_doors != car]
if (car == guess) {
random_goat_door = sample(goat_doors, size = 1)
return(random_goat_door)
} else {
the_other_goat_door = goat_doors[goat_doors != guess]
return(the_other_goat_door)
}
}
Next, we write a function which simulates the outcome of a single
game. It will have one argument, switch which will
determine whether the contestant switches the door or not.
one_game = function(switch) {
all_doors = c(1, 2, 3)
car = sample(all_doors, size = 1)
guess = sample(all_doors, size = 1)
if (switch) {
unguessed_doors = all_doors[all_doors != guess]
shown_door = show_door(car, guess)
switched_guess = unguessed_doors[unguessed_doors != shown_door]
return(switched_guess == car)
} else {
return(guess == car)
}
}
Finally we run two batches of \(10,000\) simulations, one with
switch=TRUE and another with switch=FALSE:
nsim = 10000
switch_doors = replicate(nsim, one_game(TRUE))
dont_switch_doors = replicate(nsim, one_game(FALSE))
(prob_with_switching = mean(switch_doors))
## [1] 0.668
(prob_without_switching = mean(dont_switch_doors))
## [1] 0.3288
Therefore, the probability of winning after switching is about double the probability of winning without switching. Switching is good for you!
(A philosophical note: this was the most “agnostic” approach to this
simulation. Simulations can often be simplified with a bit of insight.
For example, we could have realized that the switching strategy simply
flips the correctness of the guess (from “correct” to “wrong” and vice
versa) and used it to write a much shorter answer. Ultimately, we could
have realized that, because the probability of the initial guess being
correct is \(1/3\), switching leads to
a correct guess in \(2/3\) of the cases
(and not switching in only \(1/3\) of
the cases). In this case, the whole code would be
sample(c("correct", "incorrect"), size=10000, prob= c(2/3,1/3), replacement=TRUE),
which is an extremely inefficient way to estimate the value of the
number \(2/3\)!)
Find the beta distribution in R’s help system. The only thing you
need to know about it is that it comes with two positive parameters
\(\alpha\) and \(\beta\) (called shape1 and
shape2 in R). Simulate \(n=10000\) draws from the beta distribution
with parameters \(\alpha=0.5\) and
\(\beta=0.3\). Plot a density histogram
of your simulations as well as the pdf of the underlying distribution on
the same plot. (Note: if you use the command lines instead
of plot for the pdf, it will be automatically added to your
previous plot - the histogram, in this case).
Let \(X\) be a beta-distributed random variable with parameters \(\alpha=0.5\) and \(\beta=0.3\) (as above), and let \(Y\) be an independent random variable with the same distribution. Estimate the probability \({\mathbb{P}}[ XY > 1+ \log(Y)]\). Check graphically whether \(XY\) and \(1+\log(Y)\) are independent random variables.
Let \(X\), \(Y\) and \(Z\) be three independent random variables uniformly distributed on \((0,1)\), and let \(M = \min(X,Y,Z)\). A theorem in probability states that \(M\) has a beta distribution with some parameters \(\alpha\) and \(\beta\). Use simulation (and associated plots) to make an educated guess about what the values of \(\alpha\) and \(\beta\) are. (Hint: they are nice round numbers.)
x = rbeta(10000, shape1 = 0.5, shape2 = 0.3)
hist(x, probability = T)
x_pdf = seq(0, 1, by = 0.01)
y_pdf = dbeta(x_pdf, 0.5, 0.3)
lines(x_pdf, y_pdf, type = "l", col = "red", lwd = 2)
We use Monte Carlo to estimate the probability:
x = rbeta(10000, shape1 = 0.5, shape2 = 0.3)
y = rbeta(10000, shape1 = 0.5, shape2 = 0.3)
z = x * y > 1 + log(y)
mean(z)
## [1] 0.4457
and the permuation test to check for (in)dependence
w = x * y
z = 1 + log(y)
par(mfrow = c(2, 2))
z_perm_1 = sample(z)
z_perm_2 = sample(z)
z_perm_3 = sample(z)
plot(w, z)
plot(w, z_perm_1)
plot(w, z_perm_2)
Since the first plot looks very different from the other three, we conclude that \(XY\) and \(1+\log(Y)\) are most likely not independent.
First, we simulate \(10,000\)
draws from the distribution of \(M\).
The function min is not vectorized (and it should not be!),
so we cannot simply write \(m =
min(x,y,z)\). Luckily there is another function, called
pmin (where p stands for parallel) which
returns the component-wise min. Alternatively, we could use the function
apply to apply min to each row of the data
frame df which contains the simulations of \(X,Y\) and \(Z\):
nsim = 10000
x = runif(nsim)
y = runif(nsim)
z = runif(nsim)
m = pmin(x, y, z)
# or, alternatively,
df = data.frame(x, y, z)
m = apply(df, 1, min)
The idea is to compare the histogram of (the simulations) of \(M\) to pdfs of the beta distributions for
various values of parameters and see what fits best. The function
try_parameters below does exacly that - it superimposes the
beta pdf onto the histogram of m (just like in part 1.
above). We try a few different values, and finally settle on \(\alpha=1\) and \(\beta=3\), because it seems to fit will. It
turns out that \(\alpha=1\) and \(\beta=3\) is, indeed, correct.
try_parameters = function(alpha, beta) {
hist(m, probability = T, main = paste("Trying alpha = ", alpha, " beta = ", beta))
x_pdf = seq(0, 1, by = 0.01)
y_pdf = dbeta(x_pdf, alpha, beta)
lines(x_pdf, y_pdf, type = "l", col = "red", lwd = 2)
}
par(mfrow = c(3, 2))
try_parameters(3, 2)
try_parameters(2, 2)
try_parameters(1, 0.5)
try_parameters(0.5, 2)
try_parameters(0.5, 3)
try_parameters(1, 3)
Alternatively, you could have looked up the mean and the variance of the beta distribution online, and obtained the following expressions: \[ \text{ mean}= \frac{\alpha}{\alpha+\beta}, \text{ variance} = \frac{\alpha\beta}{(\alpha+\beta)^2 (\alpha+\beta+1)},\] and then tried to find \(\alpha\) and \(\beta\) to match the estimated mean and variance
c(mean(m), var(m))
## [1] 0.24952292 0.03719576
The first equation tells us that \(\alpha/(\alpha+\beta)\) is about \(1/4\), i.e., \(3 \alpha \approx \beta\). We plug the obtained value of \(\beta\) into the equation for variance to get the following equation: \[ 0.0372 \approx \frac{ 3 \alpha^2}{ (4\alpha)^2 (4 \alpha+1)^2},\] so that \[ 4 \alpha + 1 \approx \frac{3}{ 16\times 0.0372} \approx 5.04, \text{ i.e., } \alpha \approx 1 \text{ and } \beta \approx 3.\] This estimation technique - where the mean and variance (and possibly higher moments) are computed and matched to parameters - is called the method of moments.
In case you are curious, here is how to derive the theorem from the problem (and check that our guess is, indeed, correct). We note that for any \(x\in (0,1)\), \(M> x\) if and only if \(X>x\), \(Y>x\) and \(Z>x\). Therefore, by independence of \(X,Y\) and \(Z\), we have \[ {\mathbb{P}}[ M > x ] = {\mathbb{P}}[ X>x, Y>x, Z>x] = {\mathbb{P}}[X>x] \times {\mathbb{P}}[Y>x] \times {\mathbb{P}}[Z>x] = (1-x)^3.\] From there, we conclude that the cdf of \(M\) is given by \[F(x) = {\mathbb{P}}[M\leq x] = 1- {\mathbb{P}}[M>x] = 1 - (1-x)^3.\] Since F is a smooth function, we can differentiate it to get the pdf: \[ f(x) = F'(x) = 3 (1-x)^2,\] which is exactly the pdf of the beta distribution with parameters \(3\) and \(1\) (see the Wikipedia page of the beta distribution for this and many other facts).
Find the gamma distribution in R’s help system. The only thing
you need to know about it is that it comes with two positive parameters
\(k\) and \(\theta\) (called shape and
scale in R). Simulate \(n=10000\) draws from the gamma distribution
with parameters \(k=2\) and \(\theta=1\). Plot a density histogram of
your simulations as well as the pdf of the underlying distribution on
the same plot. Repeat for \(3\) more
choices (left to you) of the parameters \(k\) and \(\theta\).
Let \(X\) be a gamma-distributed random variable with parameters \(k=2\) and \(\theta=1\) (as above), and let \(Y_1, Y_2\) be a independent standard normals (also independent of \(X\)). Estimate the probability \({\mathbb P}[ X > Y_1^2 + Y_2^2 ]\) using Monte Carlo. Check graphically whether \(XY_1\) and \(X Y_2\) are independent.
Let \(X\), \(Y\) and \(Z\) be three independent random variables with the gamma distribution and the scale parameter \(\theta=1\). Their shape parameters are, however different, and equal to \(1,2\) and \(3\), respectively. Use simulation (and associated plots) to make an educated guess about the distribution of \(X+Y+Z\) and its parameters. (Hint: the parameters will be nice round numbers.)
hist_with_pdf <- function(k, theta) {
x = rgamma(10000, shape = k, scale = theta)
hist(x,
probability = T,
breaks = 50,
xlim = c(0,15), ylim = c(0, 0.8),
main = paste("shape = ", k, ", scale = ", theta)
)
x_pdf = seq(0, 15, by = 0.01)
y_pdf = dgamma(x_pdf, shape = k, scale = theta)
lines(x_pdf, y_pdf,
type = "l",
col = "red",
lwd = 2)
}
par(mfrow = c(2, 2))
hist_with_pdf(2,1)
hist_with_pdf(2,0.5)
hist_with_pdf(3,1)
hist_with_pdf(3,0.5)
We use Monte Carlo to estimate the probability:
x = rgamma(10000, shape = 2, scale = 1)
y1 = rnorm(10000)
y2 = rnorm(10000)
z = x > y1^2 + y2^2
mean(z)
## [1] 0.5508
and the permuation test to check for (in)dependence
w = x * y1
z = x * y2
par(mfrow = c(2, 2))
z_perm_1 = sample(z)
z_perm_2 = sample(z)
z_perm_3 = sample(z)
plot(w, z)
plot(w, z_perm_1)
plot(w, z_perm_2)
The common shape of plots 2, 3. and 4. is sufficiently different from the shape of plot 1. to conclude that \(XY_1\) and \(XY_2\) are most likely not independent. Note: if you were to compute the correlation between \(XY_1\) and \(XY_2\) you would get \(0\) - this is an example of uncorrelated random variables that are, nevertheless, not independent.
First, we simulate \(10,000\) draws from the distribution of \(S = X + Y + Z\) and plot the histogram of obtained values:
nsim = 10000
x = rgamma(nsim, shape = 1, scale = 1)
y = rgamma(nsim, shape = 2, scale = 1)
z = rgamma(nsim, shape = 3, scale = 1)
s = x + y + z
hist(s, breaks = 100)
The shape resembles the shape we obtained in 1. above, so we try the
gamma distribution with various parameters. The function
try_parameters below does exacly that - it superimposes the
gamma pdf onto the histogram of s.
try_parameters = function(k, theta) {
hist(s, probability = T, breaks = 50, main = paste("Trying shape = ", k, " scale = ",
theta))
x_pdf = seq(0, 15, by = 0.01)
y_pdf = dgamma(x_pdf, k, theta)
lines(x_pdf, y_pdf, type = "l", col = "red", lwd = 2)
}
par(mfrow = c(3, 3))
try_parameters(1, 1)
try_parameters(2, 1)
try_parameters(1, 2)
try_parameters(1, 3)
try_parameters(2, 1)
try_parameters(4, 1)
We conclude that \(X+Y+Z\) is most
likely gamma-distributed with \(k=6\)
and \(\theta=1\). This is indeed, the
case. The sum of independen gammas with shape parameters \(k_1, \dots, k_n\) and the same scale
parameter \(\theta\) is a gamma with
parameters \(k = k_1+\dots+k_n\) and
\(\theta\).
We learned how to simulate from a joint distribution of two discrete vectors \((X,Y\)) by thinking of it as one-dimensional distribution but with values represented by pairs of numbers. Here is another way this can be done:
Find the marginal distribution of one of them, say \(X\), and simulate from it
Given the value you just obtained, let’s call it \(x\), simulate from the conditional distribution of \(Y\), given \(X=x\).
| 1 | 2 | 3 | |
|---|---|---|---|
| 1 | 0.1 | 0.0 | 0.3 |
| 2 | 0.1 | 0.1 | 0.0 |
| 3 | 0.0 | 0.0 | 0.4 |
Display the contingency table of your simulations, first using counts, and then using relative frequencies. Compare to the theoretical values (i.e., the probabilities in the table above).
margin_X = c(0.2, 0.1, 0.7 )
cond_Y_X = matrix(
c( 0.5, 0.0, 3/7,
0.5, 1.0, 0.0,
0.0, 0.0, 4/7),
byrow=TRUE,
nrow=3)
single_draw = function() {
x = sample(c(1,2,3), size=1, prob=margin_X)
y = sample(c(1,2,3), size=1, prob=cond_Y_X[,x])
return(c(x,y))
}
nsim=10000
df = data.frame(
t(replicate(nsim, single_draw()))
)
colnames(df) = c("x","y")
t(table(df))
## x
## y 1 2 3
## 1 962 0 2963
## 2 1012 1023 0
## 3 0 0 4040
The variables margin_X and cond_X_Y are
what you get when you compute the marginal and the conditional
distribution from the given joint-distribution table as you did in your
probability class.
The function single_draw performs a single draw form the
distribution of \((X,Y)\) by first
drawing the value of \(X\) from its
marginal distribution. Then, it chooses the row of the conditional
distribution table according to the obtained value of \(X\) and simulates from it.
The function replicate is used to repeat
single_draw many times and collect the results. By default,
replicate attaches the output of each new “replication” as
a new column and not a row, so we need to transpose the final product.
That is what the function t() is for. We turn the result
into a data frame because the function table knows how to
handle data frames automatically. Another use of the transpose gives
x the horizontal axis, and y the vertical one,
like in the statement of the problem.
Let \((Z,W)\) is be a discrete random vector with the following joint distribution table (different rows correspond to different values of \(Z\)):
| -1 | 0 | 1 | |
|---|---|---|---|
| -1 | 0.2 | 0.1 | 0.0 |
| 0 | 0.1 | 0.2 | 0.1 |
| 1 | 0.0 | 0.1 | 0.2 |
and set \[ X = Z+W \text{ and } Y=Z W.\]
Find the distribution table for \((X,Y)\) and compute \({\mathbb{E}}[ X Y]\) and \({\mathbb{P}}[X=0 | Y=0]\) analytically (no simulations!). You can do this part on a separate piece of paper or inside your Rmd file if you know LaTeX.
Draw \(10000\) simulations of \((X,Y)\) (don’t print them out). Display the contingency table of your simulations using relative frequencies, as well as the table of errors (differences between your contingency table and the table of probabilities).
Compute \({\mathbb{E}}[X Y]\) again, but this time using your simulations from 1. above.
Draw simulations from the conditional distribution of \(X\), given \(Y=0\). Use them to estimate the conditional probability \({\mathbb{P}}[ X=0 | Y=0]\). By how much does it differ from your analytical result from 1. above.
| 0 | 1 | |
|---|---|---|
| -2 | 0.0 | 0.2 |
| -1 | 0.2 | 0.0 |
| 0 | 0.2 | 0.0 |
| 1 | 0.2 | 0.0 |
| 2 | 0.0 | 0.2 |
To compute $\EE[XY]$ we compute the value of the product $i j$ for each entry in
the table above, multiply it by the probability there and sum the obtained
results. The only non-zero terms are $1 \times (-2) \times 0.2$ and $1 \times 2
\times 0.2$, so $$ \EE[ XY ] = -2 \times 0.2 + 2 \times 0.2 = 0.$$ Finally, $$
\PP[ X=0 | Y=0] = \frac{\PP[ X=0, Y=0]}{ \PP[ Y=0]} = \frac{ 0.2}{ 0.2 + 0.2 +
0.2} = \frac{1}{3}.$$
We borrow the code from the notes to draw simulations from the distribution of \((W,Z)\), first:
joint_distribution_long = data.frame(
z = c(-1,-1,-1, 0,0,0, 1,1,1),
w = c(-1,0,1, -1,0,1, -1,0,1) )
probabilities_long = c(0.2,0.1,0.0,0.1,0.2,0.1,0.0,0.1,0.2);
sampled_rows = sample(
x = 1:nrow(joint_distribution_long),
size = 10000,
replace = TRUE,
prob = probabilities_long )
d_zw = joint_distribution_long[sampled_rows, ]
Then we transform the results into \((X,Y)\) and output the table of relative frequencies:
draws_xy = data.frame(x = d_zw$z + d_zw$w, y = d_zw$z * d_zw$w)
(cont_table = prop.table(table(draws_xy)))
## y
## x 0 1
## -2 0.0000 0.2059
## -1 0.1993 0.0000
## 0 0.1987 0.0000
## 1 0.1940 0.0000
## 2 0.0000 0.2021
We build a matrix containing the probabilities we derived in 1. above
and subtract it from cont_table:
prob_table = matrix(c(0, 0.2, 0.2, 0.2, 0, 0.2, 0, 0, 0, 0.2), nrow = 5)
(error_table = cont_table - prob_table)
## y
## x 0 1
## -2 0.0000 0.0059
## -1 -0.0007 0.0000
## 0 -0.0013 0.0000
## 1 -0.0060 0.0000
## 2 0.0000 0.0021Monte Carlo:
mean(draws_xy$x * draws_xy$y)
## [1] -0.0076First we remove all rows from draws_xy whose \(y\)-component is not \(0\)
draws_cond = draws_xy[draws_xy$y == 0, ]
Then we compute the relative frequency of the remaining draws where \(x=0\):
(est_prob = mean(draws_cond$x == 0))
## [1] 0.3356419
The true value is \(1/3\), so the error is given by
est_prob - 1/3
## [1] 0.002308559Harry Potter’s cousin Nigel Potter owns a magical set of three dice. They behave just like any three (fair, independent) 6-sided dice, except for the fact that, when thrown together, they always show three distinct numbers. In other words, the outcomes with repeating numbers never happen, while all combinations of three distinct numbers are equally likely.
Create a large (\(\ge 10,000\)) set of simulations of throws of these magical dice. Use any method you like.
Output the contingency table for the outcomes of the first two dice, and compare to theoretical probabilities.
Draw samples from the conditional distribution of the outcome of the third die, given that the sum on the first two is \(6\). Display the contingency table for your outcomes.
(Extra credit) Find the (theoretical) conditional distribution of the third die given that the sum on the first two is \(6\), and compare it to your result from 3. above.
Here are three different methods you can use to do this problem:
Sampling without replacement:
The three magical dice sample without replacement from the set \(\{1,2,3,4,5,6\}\), so we can simply do the following:
nsim = 10000
sims = data.frame(t(replicate(nsim, sample(c(1, 2, 3, 4, 5, 6), size = 3, replace = FALSE))))
colnames(sims) = c("x", "y", "z")
The rejection method:
We simulate three regular dice, and then simply reject all outcomes where two of the numbers are the same:
nsim = 25000
x0 = sample(c(1, 2, 3, 4, 5, 6), size = nsim, replace = TRUE)
y0 = sample(c(1, 2, 3, 4, 5, 6), size = nsim, replace = TRUE)
z0 = sample(c(1, 2, 3, 4, 5, 6), size = nsim, replace = TRUE)
good = (x0 != y0) & (y0 != z0) & (x0 != z0)
sims = data.frame(x = x0[good], y = y0[good], z = z0[good])
nsim = dim(sims)[1] # we rejected a random number of draws
Using sample:
We need to make the list of all allowed combinations (no repeats) and
then sample the entire triplet of dice at once. I am using
rbind to append a row to a data frame, but you can do this
in many other ways.
nsim = 100000
df = data.frame()
for (i in 1:6) {
for (j in 1:6) {
for (k in 1:6) {
if ((i != j) & (j != k) & (i != k)) {
df = rbind(df, c(i, j, k))
}
}
}
}
colnames(df) = c("x", "y", "z")
rows = as.numeric(row.names(df)) # as.numeric needed to turn strings into integers
sampled_rows = sample(rows, nsim, replace = TRUE)
sims = df[sampled_rows, ]The probability of seeing a pair \((i,j)\) is \(0\) if \(i=j\) and \(1/30\) otherwise (where \(30\) is the number of pairs of different numbers).
table_sim = table(sims$x, sims$y)/nsim
table_th = matrix(1/30, ncol = 6, nrow = 6)
for (i in 1:6) table_th[i, i] = 0
(error = table_sim - table_th)
##
## 1 2 3 4 5 6
## 1 0.000000 0.000267 0.000177 0.000047 -0.000653 0.001067
## 2 -0.000073 0.000000 -0.000623 0.000927 -0.000743 0.000207
## 3 -0.000843 0.000217 0.000000 -0.000233 -0.000543 0.000777
## 4 0.000017 0.000637 -0.000473 0.000000 0.000557 -0.000453
## 5 0.000017 -0.000373 -0.000343 0.000277 0.000000 0.000077
## 6 -0.000293 0.001307 -0.000783 -0.000023 -0.000113 0.000000good = (sims$x + sims$y == 6)
z_cond = sims$z[good]
(table_sim = table(z_cond)/length(z_cond))
## z_cond
## 1 2 3 4 5 6
## 0.13 0.12 0.25 0.12 0.12 0.25We can reuse the code from above (simulation using
sample) to list all possible elementary outcomes with \(x+y=6\) and then count the frequencies of
different values of the third die in this list (or we could simply do
that on a piece of paper):
df = data.frame()
for (i in 1:6) {
for (j in 1:6) {
for (k in 1:6) {
if ((i != j) & (j != k) & (i != k) & (i + j == 6)) {
df = rbind(df, c(i, j, k))
}
}
}
}
colnames(df) = c("x", "y", "z")
(table_th = table(df$z)/length(df$z))
##
## 1 2 3 4 5 6
## 0.12 0.12 0.25 0.12 0.12 0.25
The error is given by:
options(digits = 2)
(error = table_sim - table_th)
## z_cond
## 1 2 3 4 5 6
## 0.00520 -0.00017 -0.00279 -0.00255 -0.00173 0.00205Exactly one percent of the people in a given population have a certain disease. The accuracy of the diagnostic test for it is such that it detects the sick as sick with probability \(0.95\) and the healthy as healthy with probability \(0.9\). A person chosen at random from the population tested positive. What is the probability the he/she is, in fact, sick. Do the problem both analytically and by Monte Carlo.
The person can test positive (denoted by \(tS\) in the plot) in two ways. By actually being sick (\(S\)) and then testing positive, or by being healthy and then testing positive. Bayes formula (or simply a look at the picture above) gives us \[ {\mathbb{P}}[ S | tS ] = \frac{ 0.01 \times 0.95}{ 0.01\times 0.95 + 0.1\times 0.99 } \approx 0.088.\] Thus, even when the test is quite accurate, the probability of getting a false positive is very high.
Let us do the same via Monte Carlo. We proceed like this. First we
“pick a person” from the population by sampling from
c("H", "S") and then “test” this person. After repeating
this nsim times, we condition on the positive test, by
removing all draws where the test was negative. This leaves us with a
population of people who tested positive, and we simply need to see what
proportion of those were are actually sick.
single_draw = function() {
x = sample(c("H", "S"), size = 1, prob = c(0.99, 0.01))
if (x == "H") {
y = sample(c("tH", "tS"), size = 1, prob = c(0.9, 0.1))
} else {
y = sample(c("tH", "tS"), size = 1, prob = c(0.05, 0.95))
}
return(c(x, y))
}
nsim = 100000
df = data.frame(t(replicate(nsim, single_draw())))
colnames(df) = c("status", "test_result")
cond = (df$test_result == "tS")
df_cond = df[cond, ]
(prob = mean(df_cond$status == "S"))
## [1] 0.088
A point is chosen at random, uniformly in the unit cube \([0,1]\times [0,1]\times [0,1]\). Its distance to the origin \((0,0,0)\) is measured, and turns out to be equal to \(1.5\).
Use simulations to estimate the shape of the pdf of the conditional distribution of the point’s distance to \((1,1,1)\). Compare it to the unconditional case, i.e., the case where no information about the distance to \((0,0,0)\) is known.
Compute the mean of this (conditional) distribution for a few values of the parameter \(\varepsilon\) you use to deal with conditioning in the continuous case. Make sure you include values of \(\varepsilon\) on both sides of the spectrum - too big, and too small.
We want to vary the parameter eps later, so let’s write
a function first:
simulate = function(nsim, eps, conditional) {
x = runif(nsim)
y = runif(nsim)
z = runif(nsim)
d1 = sqrt((1 - x)^2 + (1 - y)^2 + (1 - z)^2)
if (conditional) {
d0 = sqrt(x^2 + y^2 + z^2)
cond = (d0 > 1.5 - eps) & (d0 < 1.5 + eps)
return(d1[cond])
} else {
return(d1)
}
}
Histograms may be used as approximations to the pdf of the (conditional) distribution:
nsim = 1000000
eps = 0.1
d1_cond = simulate(nsim, eps, conditional = TRUE)
d1 = simulate(nsim, eps, conditional = FALSE)
par(mfrow = c(1, 2))
hist(d1, breaks = 50)
hist(d1_cond, breaks = 50)
Note that, in addition to clearly different shapes, the supports of the two distributions differ, too. Unconditionally, the distance to \((1,1,1)\) can be any number from \(0\) to \(\sqrt{3} \approx 1.73\). If it is known that the distance to \((0,0,0)\) is \(1.5\), however, the distance to \((1,1,1)\) cannot be larger than \(1\).
Finally, let us compare the results we obtain by varying the
parameter \(\varepsilon\), first with
nsim=100000:
nsim = 100000
epss = c(2, 1, 0.5, 0.3, 0.2, 0.1, 0.02, 0.01, 0.001, 0.0001)
d1s = vector(length = length(epss))
for (eps in epss) {
sims = simulate(nsim, eps = eps, conditional = TRUE)
print(paste("Eps = ", eps, ", Draws = ", length(sims), " Mean = ", mean(sims)))
}
## [1] "Eps = 2 , Draws = 100000 Mean = 0.960342228444094"
## [1] "Eps = 1 , Draws = 93544 Mean = 0.929726634933677"
## [1] "Eps = 0.5 , Draws = 47806 Mean = 0.753283074403369"
## [1] "Eps = 0.3 , Draws = 20224 Mean = 0.595529938750379"
## [1] "Eps = 0.2 , Draws = 10658 Mean = 0.496362640888308"
## [1] "Eps = 0.1 , Draws = 3930 Mean = 0.36209037346251"
## [1] "Eps = 0.02 , Draws = 647 Mean = 0.310139851430518"
## [1] "Eps = 0.01 , Draws = 354 Mean = 0.30926108659397"
## [1] "Eps = 0.001 , Draws = 33 Mean = 0.334224126737702"
## [1] "Eps = 0.0001 , Draws = 4 Mean = 0.275311021507976"
The same experiment, but with nsim=1000000 yields:
nsim = 1000000
epss = c(2, 1, 0.5, 0.3, 0.2, 0.1, 0.02, 0.01, 0.001, 0.0001)
d1s = vector(length = length(epss))
for (eps in epss) {
sims = simulate(nsim, eps = eps, conditional = TRUE)
print(paste("Eps = ", eps, ", Draws = ", length(sims), " Mean = ", mean(sims)))
}
## [1] "Eps = 2 , Draws = 1000000 Mean = 0.960506260771782"
## [1] "Eps = 1 , Draws = 934686 Mean = 0.928432784386909"
## [1] "Eps = 0.5 , Draws = 477123 Mean = 0.753640516540863"
## [1] "Eps = 0.3 , Draws = 201634 Mean = 0.596081456550207"
## [1] "Eps = 0.2 , Draws = 103279 Mean = 0.494799342061533"
## [1] "Eps = 0.1 , Draws = 38529 Mean = 0.361335849359528"
## [1] "Eps = 0.02 , Draws = 6944 Mean = 0.3064588479179"
## [1] "Eps = 0.01 , Draws = 3371 Mean = 0.30560770769072"
## [1] "Eps = 0.001 , Draws = 331 Mean = 0.301734543251831"
## [1] "Eps = 0.0001 , Draws = 33 Mean = 0.31500400469211"
Simulate \(10000\) draws from a uniform distribution inside the cube \([-1,1]\times[-1,1] \times [-1,1]\). Go through your simulations, and discard the ones that do not lie within the unit ball (the ball centered around \((0,0,0)\), with radius \(1\).) Now you have a bunch of uniform simulations from the unit ball (Do not display them).
Use your simulations to estimate the mean and the standard deviation of \(W\), where \(W\) is the distance from the the origin to a randomly and uniformly chosen point in the unit ball.
A point was chosen randomly and uniformly in the unit ball, and it has been observed that it is closer to the point \((1,-1,1)\) than to \((1,-1,-1)\). Estimate (graphically) the shape of the conditional pdf of its distance to the origin, given this observation.
(extra credit) Do 3., but with the point chosen uniformly over the surface of the ball and for the distance to the point \((1,1,1)\) and not to the origin (which is always \(1\) in this case). Even more extra points if you do not draw any additional simulations.
We draw the \(x\), \(y\) and \(z\) coordinates independet of each other, and uniformly in \([-1,1]\). Then we discard those simulatione where the sum of the squares is \(>1\):
nsim = 10000
x_cube = runif(nsim, min = -1, max = 1)
y_cube = runif(nsim, min = -1, max = 1)
z_cube = runif(nsim, min = -1, max = 1)
in_ball = x_cube^2 + y_cube^2 + z_cube^2 <= 1
x = x_cube[in_ball]
y = y_cube[in_ball]
z = z_cube[in_ball]By Monte Carlo
W = sqrt(x^2 + y^2 + z^2)
mean(W)
## [1] 0.75
sd(W)
## [1] 0.19
Note that the mean is not half the radius, i.e. \(0.5\). Why?
We add two new variables, \(d1\) and \(d2\) - distances to \((1,-1,1)\) and \((1,-1,-1)\), respectively:
d1 = sqrt((x - 1)^2 + (y + 1)^2 + (z - 1)^2)
d2 = sqrt((x - 1)^2 + (y + 1)^2 + (z + 1)^2)
condition = d1 < d2
W_cond = W[condition]
hist(W_cond, breaks = 50, probability = T)
The main difficulty in this part of the problem is getting simulations from the uniform distributions on the sphere. One possibility is to use the rejection method, and proceed as in part 1., but keep only those points for which \[1-\varepsilon < \sqrt{x^2+y^2+z^2} < 1+\varepsilon\] for some (small) \(\varepsilon\). There is a more efficient method, though, and it does not involve any new simulations. We simply take our uniformly distributed points inside the unit ball and project each one onto the sphere, i.e. replace it by the closest point on the surface. This is easily achieved since the projection of the point \((x,y,z)\) is \[\Big( \frac{x}{\sqrt{x^2+y^2+z^2}}, \frac{y}{\sqrt{x^2+y^2+z^2}}, \frac{y}{\sqrt{x^2+y^2+z^2}}, \Big).\]
A moment’s though will convince you that those points indeed cover the sphere uniformly (no direction is preferred!).
Since we already have the distance to the origin stored in the
variable W, the code will be extremely simple:
x_s = x/W
y_s = y/W
z_s = z/W
The rest now parallels what happened in 2.
d1 = sqrt((x_s - 1)^2 + (y_s + 1)^2 + (z_s - 1)^2)
d2 = sqrt((x_s - 1)^2 + (y_s + 1)^2 + (z_s + 1)^2)
condition = d1 < d2
W1 = sqrt((x - 1)^2 + (y - 1)^2 + (z - 1)^2)
W1_cond = W1[condition]
hist(W1_cond, breaks = 50, probability = T)
A stochastic process is a sequence - finite or infinite - of random variables. We usually write \(\{X_n\}_{n\in{\mathbb{N}}_0}\) or \(\{X_n\}_{0\leq n \leq T}\), depending on whether we are talking about an infinite or a finite sequence. The number \(T\in {\mathbb{N}}_0\) is called the time horizon, and we sometimes set \(T=+\infty\) when the sequence is infinite. The index \(n\) is often interpreted as time, so that a stochastic process can be thought of as a model of a random process evolving in time. The initial value of the index \(n\) is often normalized to \(0\), even though other values may be used. This it usually very clear from the context.
It is important that all the random variables \(X_0, X_1,\dots\) “live” on the same sample space \(\Omega\). This way, we can talk about the notion of a trajectory or a sample path of a stochastic process: it is, simply, the sequence of numbers \[X_0(\omega), X_1(\omega), \dots\] but with \(\omega\in \Omega\) considered “fixed”. In other words, we can think of a stochastic process as a random variable whose value is not a number, but sequence of numbers. This will become much clearer once we introduce enough examples.
A stochastic process \(\{X\}_{n\in{\mathbb{N}}_0}\) is said to be a simple symmetric random walk (SSRW) if
\(X_0=0\),
the random variables \(\delta_1 = X_1-X_0\), \(\delta_2 = X_2 - X_1\), …, called the steps of the random walk, are independent
each \(\delta_n\) has a coin-toss distribution, i.e., its distribution is given by \[{\mathbb{P}}[ \delta_n = 1] = {\mathbb{P}}[ \delta_n=-1] = \tfrac{1}{2} \text{ for each } n.\]
Some comments:
This definition captures the main features of an idealized notion of a particle that gets shoved, randomly, in one of two possible directions, over and over. In other words, these “shoves” force the particle to take a step, and steps are modeled by the random variables variables \(\delta_1,\delta_2, \dots\). The position of the particle after \(n\) steps is \(X_n\); indeed, \[X_n = \delta_1 + \delta_2 + \dots + \delta_n \text{ for }n\in {\mathbb{N}}.\] It is important to assume that any two steps are independent of each other - the most important properties of random walks depend on this in a critical way.
Sometimes, we only need a finite number of steps of a random walk, so we only care about the random variables \(X_0, X_1,\dots, X_T\). This stochastic process (now with a finite time horizon \(T\)) will also be called a random walk. If we want to stress that the horizon is not infinite, we sometimes call it the finite-horizon random walk. Whether \(T\) is finite or infinite is usually clear from the context.
The starting point \(X_0=0\) is just a normalization. Sometimes we need more flexibility and allow our process to start at \(X_0=x\) for some \(x\in {\mathbb{N}}\). To stress that fact, we talk about the random walk starting at \(x\). If no starting point is mentioned, you should assume \(X_0=0\).
We will talk about biased (or asymmetric) random walks a bit later. The only difference will be that the probabilities of each \(\delta_n\) taking values \(1\) or \(-1\) will be \(p\in (0,1)\) and \(1-p\), and not necessarily \(\tfrac{1}{2}\), The probability \(p\) cannot change from step to step and the steps \(\delta_1, \delta_2, \dots\) will continue to be independent from each other.
The word simple in its name refers to the fact that distribution of every step is a coin toss. You can easily imagine a more complicated mechanism that would govern each step. For example, not only the direction, but also the size of the step could be random. In fact, any distribution you can think of can be used as a step distribution of a random walk. Unfortunately, we will have very little to say about such, general, random walks in these notes.
In addition to being quite simple conceptually, random walks are also easy to simulate. The fact that the steps \(\delta_n = X_n - X_{n-1}\) are independent coin tosses immediately suggests a feasible strategy: simulate \(T\) independent coin tosses first, and then define each \(X_n\) as the sum of the first \(n\) tosses.
Before we implement this idea in R, let us agree on a few conventions which we will use whenever we simulate a stochastic process:
data.frame
objectX0, X1, X2, etc.This is best achieved by the following two-stage approach in R:
write a function which will simulate a single trajectory of your process, If your process comes with parameters, it is a good idea to include them as arguments to this function.
use the function replicate to stack together many
such simulations and convert the result to a data.frame.
Don’t forget to transpose after (use the function t)
because replicate works column by column, and not row by
row.
Let’s implement this in the case of a simple random walk. Of course,
it is impossible to simulate a random walk on an infinite horizon (\(T=\infty\)) so we must restrict to
finite-horizon random walks6. The function cumsum which
produces partial sums of its input comes in very handy.
single_trajectory = function(T, p = 0.5) {
delta = sample(c(-1, 1), size = T, replace = TRUE, prob = c(1 - p, p))
x = cumsum(delta)
return(x)
}
Next, we run the same function nsim times and record the
results. It is a lucky break that the default names given to columns are
X1, X2, … so we don’t have to rename them. We
do have to add the zero-th column \(X_0=0\) because, formally speaking, the
“random variable” \(X_0=0\) is a part
of the stochastic process. This needs to be done before other columns
are added to maintain the proper order of columns, which is important
when you want to plot trajectories.
simulate_walk = function(nsim, T, p = 0.5) {
return(
data.frame(
X0 = 0,
t(replicate(nsim, single_trajectory(T, p)))
))
}
walk = simulate_walk(nsim = 10000, T = 500)
Now that we have the data frame walk, we can explore in
at least two qualitatively different ways:
Here we focus on individual random variables (column) or pairs, triplets, etc. of random variables and study their (joint) distributions. For example, we can plot histograms of the random variables \(X_5, X_8, X_{30}\) or \(X_{500}\):
We can also use various (graphical or not) devices to understand joint distributions of pairs of random variables:
If we focus on what is going on in a given row of walk,
we are going to see a different cross-section of our stochastic process.
This way we are fixing the state of the world \(\omega\) (represented by a row of
walk), i.e., the particular realization of our process, by
varying the time parameter. A typical picture associated to a trajectory
of a random walk is the following
You can try to combine the two approaches (if you must) and plot several trajectories on the same plot. While this produces pretty pictures (and has one or two genuine applications), it usually leads to a sensory overload. Note that the trajectories on the righr are jittered a bit. That means that the positions of the points are randomly shifted by a small amount. This allows us to see features of the plot that would otherwise be hidden because of the overlap.
The row-wise (or path-wise or trajectory-wise) view of the random walk described above illustrates a very important point: the random walk (and random processes in general) can be seen as random “variable” whose values are not merely numbers; they are sequences of numbers (trajectories). In other words, a random process is simply a “random trajectory”. We can simulate this random trajectory as we did above, but simulating the steps and adding them up, but we could also take a different approach. We could build the set of all possible trajectories, and then pick a random trajectory out of it.
For a random walk on a finite horizon \(T\), a trajectory is simply a sequence of natural numbers starting from \(0\). Different realizations of the coin-tosses \(\delta_n\) will lead to different trajectories, but not every sequence of natural numbers corresponds to a trajectory. For example \((0,3,4,5)\) is not possible because the increments of the random walk can only take values \(1\) or \(-1\). In fact, a finite sequence \((x_0, x_1, \dots, x_T)\) is a (possible) sample path of a random walk if and only if \(x_0=0\) and \(x_{k}-x_{k-1} \in \{-1,1\}\) for each \(k\). For example, when \(T=3\), there are \(8\) possible trajectories: \[ \begin{align} \Omega = \{ &(0,1,2,3), (0,1,2,1),(0,1,0,2), (0,1,0,-1), \\ & (0,-1,-2,-3), (0,-1,-2,-1), (0,-1,0,-2), (0,-1,0,1)\} \end{align}\] When you (mentally) picture them, think of their graphs:
Each trajectory corresponds to a particular combination of the values
of the increments \((\delta_1,\dots,
\delta_T)\), each such combination happens with probability \(2^{-T}\). This means that any two
trajectories are equally likely. That is convenient, because this puts
uniform probability on the collection of trajectories. We are now ready
to implement our simulation procedure in R; let us write the function
single_trajectory using this approach and use it to
simulate a few trajectories. We assume that a function
all_paths(T) which returns a list of all possible paths
with horizon \(T\) has already been
implemented (more info about a possible implementation in R is given in
a problem below):
T=5
Omega = all_paths(T)
single_trajectory = function() {
return(unlist(sample(size=1,Omega)))
}
simulate_walk = function(nsim, p=0.5) {
return(data.frame(
X0=0,
t(replicate(nsim, single_trajectory()))
))
}
Building a path space is not simply an exercise in abstraction. Here is how we can use is to understand the distribution of the position of the random walk:
Let \(X\) be a simple symmetric random walk with time horizon \(T=5\). What is the probability that \(X_{5}=1\)?
Let \(\Omega\) be the path space, i.e., the set of all possible trajectories of length \(5\) - there are \(2^{5}=32\) of them. The probability that \(X_{5}=1\) is the probability that a randomly picked path from \(\Omega\) will take the value \(1\) at \(n=5\). Since all paths are equally likely, we need to count the number of paths with value \(1\) at \(n=5\) and then divide by the total number of paths, i.e., \(32\).
So, how many paths are there that take value \(1\) at \(n=5\)? Each path is built out of steps of absolute value \(1\). Some of them go up (call them up-steps) and some of them go down (down-steps). A moment’s though reveals that the only way to reach \(1\) in \(5\) steps is if you have exactly \(3\) up-steps and \(2\) down-steps. Conversely, any path that has \(3\) up-steps and \(2\) down-steps ends at \(1\).
This realization transforms the problem into the following: how many paths are there with exactly \(3\) up-steps (note that we don’t have to specify that there are \(2\) down-steps - it will happen automatically). The only difference between different paths with exactly \(3\) up-steps is the position of these up-steps. In some of them the up-steps happen right at the start, in some at the very end, and in some they are scattered around. Each path with \(3\) up-steps is uniquely determined by the list of positions of those up-steps, i.e., with a size-\(3\) subset of \(\{1,2,3,4,5\}\). This is not a surprise at all, since each path is built out of increments, and positions of positive increments clearly determine values of all increments.
The problem has now become purely mathematical: how many size-\(3\) subsets of \(\{1,2,3,4,5\}\) are there? The answer comes in the form of a binomial coefficient \(\binom{5}{3}\) whose value is \(10\) - there are exactly ten ways to pick three positions out of five. Therefore, \[ {\mathbb{P}}[ X_{5} = 1] = 10 \times 2^{-5} = \frac{5}{16}.\]
Can we do this in general?
Let \(X\) be a simple symmetric random walk with time horizon \(T\). What is the probability that \(X_{n}=k\)?
The reasoning from the last example still applies. A trajectory with \(u\) up-steps and \(d\) down-steps will end at \(u-d\), so we must have \(u-d=k\). On the other hand \(u+d=n\) since all steps that are not up-steps are necessarily down-steps. This gives as a simple linear system with two equations and two unknowns which solves to \(u = (n+k)/2\), \(d=(n-k)/2\). Note the \(n\) and \(k\) must have the same parity for this solution to be meaningful. Also, \(k\) must be between \(-n\) and \(n\).
Having figured out how many up-steps is necessary to reach \(k\), all we need to do is count the number of trajectories with that many up-steps. Like before, we can do that by counting the number of ways we can choose their position among \(n\) steps, and, like before, the answer is the binomial coefficient \(\binom{n}{u}\) where \(u=(n+k)/2\). Dividing by the total number of trajectories gives us the final answer: \[ {\mathbb{P}}[ X_n = k ] = \binom{n}{ (n+k)/2} 2^{-n},\] for all \(k\) between \(-n\) and \(n\) with same parity as \(n\). For all other \(k\), the probability is \(0\).
The binomial coefficient and the \(n\)-th power suggest that the distribution of \(X_n\) might have something to do with the binomial distribution. It is clearly not the binomial, since it can take negative values, but it is related. To figure out what is going on, let us first remember what the binomial distribution is all about. Formally, it is a discrete distribution with two parameters \(n\in{\mathbb{N}}\) and \(p\in (0,1)\). Its support is \(\{0,1,2,\dots, n\}\) and the distribution is given by the following table, where \(q=1-p\)
| 0 | 1 | 2 | … | k | … | n |
|---|---|---|---|---|---|---|
| \(\binom{n}{0} q^n\) | \(\binom{n}{1} p q^{n-1}\) | \(\binom{n}{2} p^2 q^{n-2}\) | … | \(\binom{n}{k} p^k q^{n-k}\) | … | \(\binom{n}{n} p^n\) |
The binomial distribution is best understood, however, when it is expressed as a “number of successes”. More precisely,
If \(B_1,B_2,\dots, B_n\) are \(n\) independent Bernoulli random variables with the same parameter \(p\), then their sum \(B_1+\dots+B_n\) has a binomial distribution with parameters \(n\) and \(p\).
We think of \(B_1, \dots, B_n\) as indicator random variables of “successes” in \(n\) independent “experiments” each of which “succeeds” with probability \(p\). A canonical example is tossing a biased coin \(n\) times and counting the number of “heads”.
We know that the position \(X_n\) at time \(n\) of the random walk admits the representation \[ X_n = \delta_1+\delta_2+\dots+\delta_n,\] just like the binomial random variable. The distribution of \(\delta_k\) is not Bernoulli, though, since it takes the values \(-1\) and \(1\), and not \(0,1\). This is easily fixed by applying the linear transformation \(x\mapsto \frac{1}{2}(x+1)\); indeed \(( -1 +1)/2 = 0\) and \(( 1 + 1) / 2 =1\), and, so, \[ \frac{1}{2}(\delta_k+1)\text{ is a Bernoulli random variable with parameter } p=\frac{1}{2}.\] Consequently, if we add all \(B_k = \tfrac{1}{2}(1+\delta_k)\) and remember our discussion from above we get the following statement
In a simple symmetric random walk the random variable \(\frac{1}{2} (n + X_n)\) has the binomial distribution with parameters \(n\) and \(p=1/2\), for each \(n\).
Can you use that fact to rederive the distribution of \(X_n\)?
If the steps of the random walk preferred one direction to the other, the definition would need to be tweaked a little bit and the word “symmetric” in the name gets replaced by “biased” (or “asymmetric”):
A stochastic process \(\{X\}_{n\in{\mathbb{N}}_0}\) is said to be a **simple biased random walk with parameter \(p\in (0,1)\) if
\(X_0=0\),
the random variables \(\delta_1 = X_1-X_0\), \(\delta_2 = X_2 - X_1\), …, called the steps of the random walk, are independent and
each \(\delta_n\) has a biased coin-toss distribution, i.e., its distribution is given by \[{\mathbb{P}}[ \delta_n = 1] = p \text{ and } {\mathbb{P}}[ \delta_n=-1] = 1-p \text{ for each } n.\]
As far as the distribution of \(X_n\) is concerned, we don’t expect it to be the same as in the symmetric case. After all, the biased random walk (think \(p=0.999\)) will prefer one direction over the other. Our trick with writing \(\frac{1}{2}(n+X_n)\) as a sum of Bernoulli random variables still works. We just have to remember that \(p\) is not \(\frac{1}{2}\) anymore to conclude that \(\tfrac{1}{2}(X_n + n)\) has the binomial distribution with parameters \(n\) and \(p\); if we put \(u = (n+k)/2\) we get \[\begin{align} {\mathbb{P}}[ X_n = k] &= {\mathbb{P}}[ \tfrac{1}{2}(X_n+n) = u] = \binom{n}{u} p^u q^{n-u}\\ & = \binom{n}{\frac{1}{2}(n+k)} p^{\frac{1}{2}(n+k)} q^{\frac{1}{2}(n-k)}. \end{align}\] Note that be binomial coefficient stays the same as in the symmetric case, but the factor \(2^{-n} = (1/2)^{\frac{1}{2}(n+k)} (1/2)^{\frac{1}{2}(n-k)}\) becomes \(p^{\frac{1}{2}(n+k)} q^{\frac{1}{2}(n-k)}\).
Can we reuse the sample space \(\Omega\) to build a biased random walk? Yes, we can, but we need to assign possibly different probabilities to individuals. Indeed, if \(p=0.99\), the probability that all the increments \(\delta\) of a \(10\)-step random walk take the value \(+1\) is \((0.99)^{10} \approx 0.90\). This is much larger than the probability that all steps take the value \(-1\), which is \((0.01)^{10}= 10^{-20}\).
In general, the probability that a particular path is picked out of \(\Omega\) will depend on the number of up-steps and down-steps; more precisely it equals \(p^u q^{n-u}\) where \(u\) is the number of up-steps. The interesting thing is that the number of up-steps \(u\) depends only on the final position \(x_n\) of the path; indeed \(u = \frac{1}{2}(n+x_n)\). This way, all paths of length \(T=5\) that end up at \(1\) get the same probability of being chosen, namely \(p^3 q^2\). Let us use the awful seizure-inducing graph with multiple paths for good, and adjust the each path according to its probability; some jitter has been added to deal with overlap. The lighter-colored paths are less likely to happen then the darker-colored paths.
Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a simple symmetric random walk. Which of the following processes are simple random walks?
\(\{2 X_n\}_{n\in {\mathbb{N}}_0}\) ?
\(\{X^2_n\}_{n\in {\mathbb{N}}_0}\) ?
\(\{-X_n\}_{n\in {\mathbb{N}}_0}\) ?
\(\{ Y_n\}_{n\in {\mathbb{N}}_0}\), where \(Y_n = X_{5+n}-X_5\) ?
How about the case \(p\ne \tfrac{1}{2}\)?
No - the support of the distribution of \(X_1\) is \(\{-2,2\}\) and not \(\{-1,1\}\).
No - \(X_1^2=1\), and not \(\pm 1\) with equal probabilities.
Yes - all parts of the definition check out.
Yes - all parts of the definition check out.
The answers are the same if \(p\ne \tfrac{1}{2}\), but, in 3., \(-X_n\) comes with probability \(1-p\) of an up-step, and not \(p\).
Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a simple random walk.
Find the distribution of the product \(X_1 X_2\)
Compute \({\mathbb{P}}[ |X_1 X_2 X_3|]=2\)
Find the probability that \(X\) will hit neither the level \(2\) nor the level \(-2\) until (and including) time \(T=3\)
Find the independent pairs of random variables among the following choices:
| 0 | 2 |
|---|---|
| 0.5 | 0.5 |
\(|X_1 X_2 X_3|=2\) only in the following two cases \[ X_1=1, X_2=2, X_3=1 \text{ or } X_1=-1, X_2=-2, X_3=-1.\] Each of those paths has probability \(1/8\) of happening, so \({\mathbb{P}}[ |X_1 X_2 X_3| = 2] = 1/4\).
The only chance for \(X\) to hit \(2\) or \(−2\) before or at T = 3 is at time \(n = 2\). Since \(X_2 \in \{ -2, 0, 2\}\), this happens with probability \({\mathbb{P}}[ X_2 \in \{-2,2\}] = 1 - {\mathbb{P}}[X_2 = 0] = 0.5\).
The only independent pair is \(X_4 - X_2\) and \(X_6 - X_5\) because the two random variables are build out of completely different increments: \(X_4 - X_2 = \delta_3+\delta_4\) while \(X_6-X_5 = \delta_6\). The others are not independent. For example, if we are told that \(X_1+X_3 = 4\), it necessarily follows that \(\delta_1= \delta_2=\delta_3=1\). Hence, \(X_2+X_4 = 2\delta_1+2\delta_2+\delta_3+\delta_4 = 5+\delta_4\) which cannot be less than \(4\). On the other hand, without any information, \(X_2+X_4\) can easily be negative.
Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a simple random walk.
Compute \({\mathbb{P}}[ X_{32} = 4| X_8 = 6]\).
Compute \({\mathbb{P}}[ X_9 = 3 \text{ and } X_{15}=5 ]\)
(extra credit) Compute \({\mathbb{P}}[ X_7 + X_{12} = X_1 + X_{16}]\)
This is the same as \({\mathbb{P}}[ X_{32}- X_8 = -2 | X_8=6]\). The random variables \(X_8\) and \(X_{32}-X_8\) are independent (as they are built out of different \(\delta\)s), so we can remove the conditioning. It remains to compute \({\mathbb{P}}[X_{32} - X_8 = -2]\). For that, we note that \(X_{32} - X_8\) is a sum of \(24\) independent coin tosses, so its distribution is the same as that of \(X_{24}\). Therefore, by our formula for the distribution of \(X_n\), we have \[ {\mathbb{P}}[X_{32}= 4 | X_8 = 6] = {\mathbb{P}}[X_{24} = -2] = \binom{24}{11} 2^{-24}.\]
We have \[\begin{align} {\mathbb{P}}[ X_9 = 3 \text{ and } X_{15}=5 ] & = {\mathbb{P}}[ X_{15} = 5 | X_9 = 3] \times {\mathbb{P}}[ X_9=3] \\ & = {\mathbb{P}}[ X_6 = 2] \times {\mathbb{P}}[X_9=3] = \binom{6}{4} 2^{-6} \binom{9}{6} 2^{-9}, \end{align}\] where we used the same ideas as in 1. above
We rewrite everything using \(\delta\)s: \[\begin{align} X_7+X_{12} = X_1+X_{16} &\Leftrightarrow X_7-X_1 = X_{16}-X_{12} \Leftrightarrow \delta_2+\dots+\delta_7 = \delta_{13} + \dots+\delta_{16}\\ & \Leftrightarrow (-\delta_2) + \dots + (-\delta_7) + \delta_{13}+ \dots + \delta_{16} = 0. \end{align}\] Since \(-\delta_k\) has the same distribution as \(\delta_k\) (both are coin tosses) and remains independent of all other \(\delta_i\), the left-hand side of the last expression in the chain of equivalences above is a sum of \(10\) indepenedent coin tosses. Therefore, the probability that it equals \(0\) is the same as \({\mathbb{P}}[X_{10}=0] = \binom{10}{5} 2^{-10}\).
Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a simple random walk. For \(n\in{\mathbb{N}}\) compute the probability that \(X_{2n}\), \(X_{4n}\) and \(X_{6n}\) take the same value.
Increments \(X_{4n}-X_{2n}\) and \(X_{6n} - X_{4n}\) are independent, and each is a sum of \(2n\) independent coin tosses (therefore has the same distribution as \(X_{2n}\)). Hence, \[\begin{aligned} {\mathbb{P}}[ X_{2n} = X_{4n} \text{ and }X_{4n} = X_{6n} ] &= {\mathbb{P}}[ X_{4n} - X_{2n} = 0 \text{ and }X_{6n} - X_{4n} = 0 ]\\ &= {\mathbb{P}}[ X_{4n} - X_{2n} = 0] \times {\mathbb{P}}[ X_{6n} - X_{4n}=0]\\ &={\mathbb{P}}[ X_{2n}=0] \times {\mathbb{P}}[ X_{2n} =0 ]\\ & = \binom{2n}{n} 2^{-2n} \binom{2n}{n} 2^{-2n} = \binom{2n}{n}^2 2^{-4n}. \end{aligned}\]
Write an R function (call it all_paths) which takes an
integer argument T and returns a list of all possible paths
of a random walk with time horizon \(T\). (Note: Since vectors cannot have other
vectors as elements, you will need to use a data structure called
list for this. It behaves very much like a vector, so it
should not be a problem.)
The implementation below uses the function combn which
returns the list of all subsets of a certain size of a certain vector.
Since each path is determined by the positions of its up-steps, we need
to loop through all numbers \(i\) from
\(0\) to \(T\) and then list all subsets of the size
\(i\). The next step is to turn a set
of positions to a path of a random walk. This can be done in many ways;
one is implemented implemented in choice_to_path using
vector indexing.
choice_to_path = function(comb, T) {
increments = rep(-1, T)
increments[comb] = 1
path = cumsum(increments)
return(path)
}
all_paths = function(T) {
Omega = list(2^T)
index = 1
for (i in 0:T) {
choices = combn(T, i, simplify = FALSE)
for (choice in choices) {
Omega[[index]] = choice_to_path(choice, T)
index = index + 1
}
}
return(Omega)
}
Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a simple symmetric random walk. Given \(n\in{\mathbb{N}}_0\) and \(k\in{\mathbb{N}}\), compute \(\operatorname{Var}[X_n]\), \(\operatorname{Cov}[X_n, X_{n+k}]\) and \(\operatorname{corr}[X_n, X_{n+k}]\), where \(\operatorname{Cov}\) stands for the covariance and \(\operatorname{corr}\) for the correlation. (Note: look up \(\operatorname{Cov}\) and \(\operatorname{corr}\) if you forgot what they are).
Compute \(\lim_{n\to\infty} \operatorname{corr}[X_n, X_{n+k}]\) and \(\lim_{k\to\infty} \operatorname{corr}[X_n, X_{n+k}]\). How would you interpret the results you obtained?
We have \(\operatorname{Var}[\delta_i] = 1\) for each \(i\in{\mathbb{N}}\), so \[\operatorname{Var}[X_n] = \sum_{i=1}^n \operatorname{Var}[\delta_i] = n.\] Since \({\mathbb{E}}[X_n] = {\mathbb{E}}[X_{n+k}]=0\) and \(X_{n+k} - X_n\) is independent of \(X_n\), we have \[\begin{aligned} \operatorname{Cov}[X_n,X_{n+k}] &= {\mathbb{E}}[ X_n X_{n+k}] = {\mathbb{E}}[ X_n (X_{n+k} - X_n)] + {\mathbb{E}}[X_n^2] = {\mathbb{E}}[X_n] {\mathbb{E}}[X_{n+k} - X_n] + {\mathbb{E}}[X_n^2]\\ &= {\mathbb{E}}[X_n^2] = n. \end{aligned}\] Finally, \[\begin{aligned} \operatorname{corr}[X_n, X_{n+k}] = \frac{\operatorname{Cov}[X_n, X_{n+k}]}{\sqrt{\operatorname{Var}[X_n]} \sqrt{\operatorname{Var}[X_{n+k}]}} = \frac{n}{\sqrt{n(n+k)}} = \sqrt{\frac{n}{n+k}}. \end{aligned}\] When we let \(n\to\infty\), we get \(1\). This means that the positions of the random walk, \(k\) steps apart, get closer and close to perfect correlation as \(n\to\infty\). If you know \(X_n\) and \(n\) is large, you almost know \(X_{n+k}\), at least at the typical scale of \(X_n\).
When we let \(k\to\infty\), we get \(0\). That means that as the gap between two points in time gets larger, the values get less and less correlated. In a sense, the random walk tends to forget its past after a large number of steps.
Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a simple random walk with \({\mathbb{P}}[X_1=1]=p\in (0,1)\), and let \(A_n\) be the (signed) area under its graph (in the picture below, \(A_n\) is the area of the blue part minus the area of the orange part).
Find a formula for \(A_n\) in terms of \(X_1,\dots, X_n\).
Compute \({\mathbb{E}}[A_n]\) and \(\operatorname{Var}[A_n]\), for \(n\in{\mathbb{N}}\). (You will find the following formulas helpful \(\sum_{j=1}^n j = \frac{n(n+1)}{2}\) and \(\sum_{j=1}^n j^2=\frac{n(n+1)(2n+1)}{6}\).)
Use simulations to approximate the entire distribution of \(A_n\) (set \(p=0.3\), \(n=10\) run \(10000\) simulations): display the entire distribution table. Next, compute Monte-Carlo approximations to \({\mathbb{E}}[A_{10}]\) and \(\operatorname{Var}[A_{10}]\) and compare them to the exact values obtained in 2. above.
Let \(\{X_n\}_{0 \leq n \leq 100}\) be the simple symmetric random walk with time horizon \(T = 100\). We define the following random variables
Draw nsim simulations of \(L\), \(P\)
and \(R\) and display their histograms
on the same graph, side by side. Set the number of bins (option
breaks in the command hist) to \(50\). Choose the number nsim
so that the simulations take no more than 2 minutes, but do not go over
\(100,000\).
(Hint: For each of the three random variables write a
function which takes a single trajectory of a random walk as an argument
and returns its value for that trajectory. \(P\) is easy, and for \(L\) and \(R\) use the function match. Be
careful - match finds the first match, but you need the
last one. Then simulate the random walk and apply your
functions to the rows of your ouput data frame.)
What do you observe? What conjecture can you make about distributions of \(L\), \(P\) and \(R\)?
(Extra credit) Even though \(L\), \(P\) and \(R\) are discrete random variables, it turns out that distributions of their normalized versions \(L/T\), \(P/T\) and \(R/T\) are close to named continuous distributions. Guess what these distributions are and explain (graphically) why you think your guess is correct.
Counting trajectories in order to compute probabilities is a powerful method, as our next example shows. It also reveals a potential weakness of the combinatorial approach: it works best when all \(\omega\) are equally likely (e.g., when \(p=\tfrac{1}{2}\) in the case of the random walk).
We start by asking a simple question: what is the typical record value of the random walk, i.e., how far “up” (or “right” depending on your point of view) does it typically get? Clearly, the largest value it can attain is \(T\). This happens only when all coin tosses came up \(+1\), an extremely unlikely event - its probability is \(2^{-T}\). On the other hand, this maximal value is at least \(0\), since \(X_0=0\), already. A bit of thought reveals that any value between those two extremes is possible, but it is not at all easy to compute their probabilities.
More precisely, if \(\{X_n\}\) is a
simple random walk with time horizon \(T\). We define its running-maximum
process \(\{M_n\}_{n\in
{\mathbb{N}}_0}\) by \[M_n=\max(X_0,\dots, X_n),\ \text{ for }0 \leq n
\leq T,\] and ask what the probabilities \({\mathbb{P}}[M_n = k]\) for \(k=0,\dots, n\) are. An easy numerical
solution to this problem can be given by simulation. We reuse the
function simulate_walk defined at the beginning of the
chapter, but also employ a new function, called apply which
“applies” a function to each row (or column) of a data frame or a
matrix. It seems to be tailor-made for our purpose7 because we want to
compute the maximum of each row of the simulation matrix (remember - the
row means keep the realization fixed, but vary the time-index \(n\)). The syntax of apply is
simple - it needs the data frame, the margin (rows are coded as 1 and
columns as 2; so when the margin is 1, the function is applied row-wise
and when the margin is 2, the function is applied column-wise) and the
function to be applied (max in our case). The output is a
vector of size nsim with all row-wise maxima:
walk = simulate_walk(nsim = 100000, T = 12, p = 0.5)
M = apply(walk, 1, max)
hist(M, breaks = seq(-0.5, 12.5, 1), probability = TRUE)
The overall shape of the distribution is as we expected; the support is \(\{0,1,2,\dots, 12\}\) and the probabilities tend to decrease as \(k\) gets larger. The unexpected feature is that \({\mathbb{P}}[ M_{12} = 1]\) seems to be the same as \({\mathbb{P}}[ M_{12} = 2]\). It drops after that for \(k=3\), but it looks like \({\mathbb{P}}[ M_{12} = 3] = {\mathbb{P}}[ M_{12}=4]\) again. Somehow the probability does not seem to change at all from \(2i-1\) to \(2i\).
Fortunately, there is an explicit formula for the distribution of \(M_n\) and we can derive it by a nice counting trick known as the reflection principle.
As usual, we may assume without loss of generality that \(T=n\) since the values of \(\delta_{n+1}, \dots, \delta_T\) do not affect \(M_n\) at all. We start by picking a level \(l\in\{1,\dots, n\}\) and first compute the probability \({\mathbb{P}}[M_n\geq l]\) - it will turn out to be easier than attacking \({\mathbb{P}}[ M_n=l]\) directly. The symmetry assumption \(p=1/2\) ensures that all trajectories are equally likely, so we can do this by counting the number of trajectories whose maximal level reached is at least \(l\), and then multiply by \(2^{-n}\).
What makes the computation of \({\mathbb{P}}[M_n \geq l]\) a bit easier than that of \({\mathbb{P}}[ M_n = l]\) is the following equivalence
\[M_n\geq l \text{ if and only if } X_k=l \text{ for some } k.\]
In words, the set of trajectories whose maximum is at least \(l\) is exactly the same as the set of trajectories that hit the level \(l\) at some time. Let us denote the set of trajectories \(\omega\) with this property by \(A_l\), so that \({\mathbb{P}}[ M_n \geq l] = {\mathbb{P}}[A_l]\). We can further split \(A_l\) into three disjoint events \(A_l^{>}\), \(A_l^{=}\) and \(A_l^{<}\), depending on whether \(X_n<l\), \(X_n=l\) or \(X_n>l\). In the picture below, the red trajectory is in \(A_l^{>}\), the green trajectory in \(A_l^=\) the orange one in \(A_l^{<}\), while the blue one is not in \(A_l\) at all.
With the set of all trajectories \(\Omega\) partitioned into four disjoint classes, namely \(A^>_l, A^=_l, A^<_l\) and \((A_l)^c\), we are ready to reveal the main idea behind the reflection principle:
To see why that is true, start by choosing a trajectory \(\omega\in A_l^{>}\) and denoting by \(\tau_l(\omega)\) the first time \(\omega\) visits the level \(l\). Since \(\omega \in A^>\) such a time clearly exists. Then we associate to \(\omega\) another trajectory, call it \(\bar{\omega}\), obtained from \(\omega\) in the following way:
Equivalently the increments of \(\omega\) and \(\bar{\omega}\) are exactly the same up to time \(\tau(\omega)\), and exactly the opposite afterwards. In the picture below - the orange trajectory is \(\omega\) and the green trajectory is its “reflection” \(\bar{\omega}\); note that they overlap until time \(5\):
## Warning in geom_text(aes(x = 11.5, y = 5, label = paste("omega")), linewidth =
## 5, : Ignoring unknown parameters: `linewidth`
## Warning in geom_text(aes(x = 11.5, y = 1, label = paste("bar(omega)")), :
## Ignoring unknown parameters: `linewidth`
Convince yourself that this procedure establishes a bijection between the sets \(A_l^{>}\) and \(A_l^{<}\), making these two sets equal in size.
So why is it important to know that \(\#
A_l^> = \# A_l^<\)? Because the trajectories in \(A_l^>\) (as well as in \(A_l^=\)) are easy to count. For them, the
requirement that the level \(l\) is hit
at a certain point is redundant; if you are at or above \(l\) at the very end, you must have hit
\(l\) at a certain point.
Therefore, \(A_l^{>}\) is simply the
family of those trajectories \(\omega\)
whose final positions \(X_n(\omega)\)
are somewhere strictly above \(l\).
Hence, \[\begin{align}
{\mathbb{P}}[A_l^{>}] &= {\mathbb{P}}[ X_n=l+1 \text{ or }
X_n = l+2 \text{ or } \dots \text{ or }
X_n=n]\\ & = \sum_{k=l+1}^n {\mathbb{P}}[X_n = k]
\end{align}\]
Similarly, \[\begin{aligned} {\mathbb{P}}[ A_l^{=}] = {\mathbb{P}}[X_n=l].\end{aligned}\] Finally, by the reflection principle, \[\begin{aligned} {\mathbb{P}}[ A_l^{<}] = {\mathbb{P}}[A_l^{>}] = \sum_{k=l+1}^n {\mathbb{P}}[X_n=k].\end{aligned}\]
Putting all of this together, we get \[\begin{aligned} {\mathbb{P}}[ A_l ] = {\mathbb{P}}[ X_n=l] + 2 \sum_{k=l+1}^n {\mathbb{P}}[X_n=k],\end{aligned}\] so that \[\begin{aligned} {\mathbb{P}}[ M_n = l ] &= {\mathbb{P}}[ M_n \geq l] - {\mathbb{P}}[ M_n \geq l+1]\\ & = {\mathbb{P}} [A_l] - {\mathbb{P}} [A_{l+1}]\\ & = {\mathbb{P}}[ X_n = l] + 2 {\mathbb{P}}[X_n = l+1] + 2{\mathbb{P}}[X_n = l+2]+ \dots + 2{\mathbb{P}}[ X_n=n] -\\ & \qquad \qquad \quad \ - {\mathbb{P}}[ X_n = l+1] - 2 {\mathbb{P}}[X_n = l+2] - \dots - 2{\mathbb{P}}[ X_n=n]\\ &= {\mathbb{P}}[ X_n=l] + {\mathbb{P}}[X_n=l+1] \end{aligned}\]
Now that we have the explicit expression \[ {\mathbb{P}}[ M_n = l ] = {\mathbb{P}}[ X_n=l] + {\mathbb{P}}[X_n = l+1] \text{ for } l=0,1,\dots, n,\] we can shed some light on the fact on the shape of the histogram for \(M_n\) we plotted above. Since \({\mathbb{P}}[X_n=l]\) is \(0\) if \(n\) and \(l\) don’t have the same parity, it is clear that only one of the probabilities \({\mathbb{P}}[X_n=l]\) and \({\mathbb{P}}[X_n=l+1]\) can be positive. It follows that, for \(n\) even, we have \[\begin{align} {\mathbb{P}}[ M_n =0] &= {\mathbb{P}}[X_n=0] + {\mathbb{P}}[X_n=1] = {\mathbb{P}}[X_n=0]\\ {\mathbb{P}}[M_n=1] &= {\mathbb{P}}[ X_n=1] + {\mathbb{P}}[X_n=2] = {\mathbb{P}}[X_n=2]\\ {\mathbb{P}}[M_n=2] &= {\mathbb{P}}[ X_n=2] + {\mathbb{P}}[X_n=3] = {\mathbb{P}}[X_n=2]\\ {\mathbb{P}}[M_n=3] &= {\mathbb{P}}[ X_n=3] + {\mathbb{P}}[X_n=4] = {\mathbb{P}}[X_n=4]\\ {\mathbb{P}}[M_n=4] &= {\mathbb{P}}[ X_n=4] + {\mathbb{P}}[X_n=5] = {\mathbb{P}}[X_n=4] \text{ etc.} \end{align}\] In a similar way, for \(n\) odd, we have \[\begin{align} {\mathbb{P}}[ M_n =0] &= {\mathbb{P}}[X_n=0] + {\mathbb{P}}[X_n=1] = {\mathbb{P}}[X_n=1]\\ {\mathbb{P}}[M_n=1] &= {\mathbb{P}}[ X_n=1] + {\mathbb{P}}[X_n=2] = {\mathbb{P}}[X_n=1]\\ {\mathbb{P}}[M_n=2] &= {\mathbb{P}}[ X_n=2] + {\mathbb{P}}[X_n=3] = {\mathbb{P}}[X_n=3]\\ {\mathbb{P}}[M_n=3] &= {\mathbb{P}}[ X_n=3] + {\mathbb{P}}[X_n=4] = {\mathbb{P}}[X_n=3]\\ {\mathbb{P}}[M_n=4] &= {\mathbb{P}}[ X_n=4] + {\mathbb{P}}[X_n=5] = {\mathbb{P}}[X_n=5] \text{ etc.} \end{align}\]
Here is a example of a typical problem where the reflection principle (i.e., the formula for \({\mathbb{P}}[M_n=k]\)) is used:
Let \(X\) be a simple symmetric random walk. What is the probability that \(X_n\leq 0\) for all \(0\leq n \leq T\)?
This is really a question about the maximum, but in disguise. The walk will stay negative or \(0\) if and only if its running maximum \(M_T\) at time \(T\) takes the value \(0\). By our formula for \({\mathbb{P}}[M_n=l]\) we have \[ {\mathbb{P}}[M_T=0] = {\mathbb{P}}[X_T=0] + {\mathbb{P}}[X_T = 1].\] When \(T=2N\) this evaluates to \(\binom{2N}{N} 2^{-2N}\), and when \(T=2N-1\) to \(\binom{2N-1}{N} 2^{-(2N-1)}\).
What is the probability that a simple symmetric random walk will reach the level \(l=1\) in \(T\) steps or fewer? What happens when \(T\to\infty\)?
The first question is exactly the opposite of the question in our previous example, so the answer is \[ 1 - {\mathbb{P}}[M_T=0] = 1- {\mathbb{P}}[X_T=0] - {\mathbb{P}}[X_T=1].\] As above, this evaluates to \(\binom{2N}{N} 2^{-2N}\) when \(T=2N\) is even (we skip the case of odd \(T\) because it is very similar). When \(N\to\infty\), we expect \(\binom{2N}{N}\) to go to \(+\infty\) and \(2^{-2N}\) to go to \(0\), so it is not immediately clear which term will win. One way to make a guess is to think about it probabilistically: we are looking at the probability \({\mathbb{P}}[X_{2N}=0]\) that the random walk takes the value \(0\) after exactly \(2N\) steps. Even though no other (single) value is more likely to happen, there are so many other values \(X_{2N}\) could take (anything even from \(-2N\) to \(2N\) except for \(0\)) that we conjecture that its probability converges to \(0\). A formal mathematical argument which proves that our conjecture is, indeed correct, involves Stirling’s formula:
\[ N! \sim \sqrt{2 \pi N} \left( \frac{N}{e} \right)^N \text{ where } A_N \sim B_N \text{ means that } \lim_{N\to\infty} \frac{A_N}{B_N}=1. \]
We write \(\binom{2N}{N} = \tfrac{(2N)!}{N! N!}\) and apply Stirling’s formula to each factorial (let’s skip the details) to conclude that \[ \binom{2N}{N} 2^{-2n}\sim \frac{1}{\sqrt{N \pi}} \text{ so that } \lim_{N\to\infty} \binom{2N}{N} 2^{-2n} = 0 \]
The result of the previous problem implies the following important fact:
The simple symmetric random walk will reach the level \(1\), with certainty, given enough time.
Indeed, we just proved that the probability of this not happening during the first \(T\) steps shrinks down to \(0\) as \(T\to\infty\).
But wait, there is more! By symmetry, the level \(1\) can be replaced by \(-1\). Also, once we hit \(1\), the random walk “renews itself” (this property is called the Strong Markov Property and we will talk about it later), so it will eventually hit the level \(2\), as well. Continuing the same way, we get the following remarkable result
Sooner or later, the symple symmetric random walk will visit any level.
We close this chapter with an application of the reflection principle to a classical problem in probability and combinatorics. Feel free to skip it if you want to.
Suppose that two candidates, Daisy and Oscar, are running for office, and \(T \in{\mathbb{N}}\) voters cast their ballots. Votes are counted the old-fashioned way, namely by the same official, one by one, until all \(T\) of them have been processed. After each ballot is opened, the official records the number of votes each candidate has received so far. At the end, the official announces that Daisy has won by a margin of \(k>0\) votes, i.e., that Daisy got \((T+k)/2\) votes and Oscar the remaining \((T-k)/2\) votes. What is the probability that at no time during the counting has Oscar been in the lead?
We assume that the order in which the official counts the votes is completely independent of the actual votes, and that each voter chooses Daisy with probability \(p\in (0,1)\) and Oscar with probability \(q=1-p\). We don’t know a priori what \(p\) is, and, as it turns out, we don’t need to!
For \(0 \leq n \leq T\), let \(X_n\) be the number of votes received by Daisy minus the number of votes received by Oscar in the first \(n\) ballots. When the \(n+1\)-st vote is counted, \(X_n\) either increases by \(1\) (if the vote was for Daisy), or decreases by 1 otherwise. The votes are independent of each other and \(X_0=0\), so \(X_n\), \(0\leq n \leq T\) is a simple random walk with the time horizon \(T\). The probability of an up-step is \(p\in (0,1)\), so this random walk is not necessarily symmetric. The ballot problem can now be restated as follows:
For a simple random walk \(\{X_n\}_{0\leq n \leq T}\), what is the probability that \(X_n\geq 0\) for all \(n\) with \(0\leq n \leq T\), given that \(X_T=k\)?
The first step towards understanding the solution is the realization that the exact value of \(p\) does not matter. Indeed, we are interested in the conditional probability \({\mathbb{P}}[ F|G]={\mathbb{P}}[F\cap G]/{\mathbb{P}}[G]\), where \(F\) denotes the set of \(\omega\) whose corresponding trajectories always stay non-negative, while the trajectories corresponding to \(\omega\in G\) reach \(k\) at time \(T\). Each \(\omega \in G\) consists of exactly \((T+k)/2\) up-steps (\(1\)s) and \((T-k)/2\) down steps (\(-1\)s), so its probability weight is equal to \(p^{ (T+k)/2} q^{(T-k)/2}\). Therefore, with \(\# A\) denoting the number of elements in the set \(A\), we get \[\begin{aligned} {\mathbb{P}}[ F|G]=\frac{{\mathbb{P}}[F\cap G]}{{\mathbb{P}}[G]}=\frac{\# (F\cap G) \ p^{ (T+k)/2} q^{(T-k)/2}}{ \# G \ p^{ (T+k)/2} q^{(T-k)/2}}=\frac{\#(F\cap G)}{\# G}.\end{aligned}\] This is quite amazing in and of itself. This conditional probability does not depend on \(p\) at all!
Since we already know how to count the number of elements in \(G\) (there are \(\binom{T}{(T+k)/2}\)), “all” that remains to be done is to count the number of elements in \(G\cap F\). The elements in \(G \cap F\) form a portion of all the elements in \(G\) whose trajectories don’t hit the level \(l=-1\); this way, \(\#(G\cap F)=\#G-\#H\), where \(H\) is the set of all paths which finish at \(k\), but cross (or, at least, touch) the level \(l=-1\) in the process. Can we use the reflection principle to find \(\# H\)? Yes, we can. In fact, you can convince yourself that the reflection of any trajectory corresponding to \(\omega \in H\) around the level \(l=-1\) after its last hitting time of that level produces a trajectory that starts at \(0\) and ends at \(-k-2\), and vice versa.
The number of paths from \(0\) to \(-k-2\) is easy to count - it is equal to \(\binom{T}{(T+k)/2+1}\). Putting everything together, we get \[{\mathbb{P}}[ F|G]=\frac{\binom{T}{n_1}-\binom{T}{n_1+1}} {\binom{T}{n_1}}=\frac{k+1}{n_1+1},\text{ where }n_1=\frac{T+k}{2}.\] The last equality follows from the definition of binomial coefficients \(\binom{T}{i}=\frac{T!}{i!(T-i)!}\).
The Ballot problem has a long history (going back to at least 1887) and has spurred a lot of research in combinatorics and probability. In fact, people still write research papers on some of its generalizations. When posed outside the context of probability, it is often phrased as “in how many ways can the counting be performed …” (the difference being only in the normalizing factor \(\binom{T}{n_1}\) appearing in Example above). A special case \(k=0\) seems to be even more popular - the number of \(2n\)-step paths from \(0\) to \(0\) never going below zero is called the \(n\)-th Catalan number and equals \[\begin{align} C_n=\frac{1}{n+1} \binom{2n}{n}. \end{align}\]
Given \(n\in{\mathbb{N}}\), compute \({\mathbb{P}}[ \tau_1 = 2n+1 ]\) for a simple, but possibly biased, random walk. (Note: Clearly, \({\mathbb{P}}[ \tau_1=2n]=0\).)
Let \(A\) denote the set of all trajectories of length \(2n+1\) that hit \(1\) for the first time at time \(2n+1\), and let \(A'\) be the set of all trajectories of length \(2n\) which stay at or below \(0\) at all times and take the value \(0\) at time \(2n\). Clearly, each trajectory in \(A\) is a trajectory in \(A'\) with \(1\) attached at the very end, so that \(\# A = \# A'\).
By the (last part) of the previous problem, \(\# A' = \frac{1}{n+1} \binom{2n}{n}\) (the \(n^{\text{th}}\) Catalan number). As above, all paths in \(A\) have the same probability weight, namely \(p^{n+1} q^n\), so \[ {\mathbb{P}}[ \tau_1 = 2n+1]= p^{n+1} q^n \frac{1}{n+1} \binom{2n}{n}.\]
Given \(p\in (0,1)\),
Using the previous problem, we need to sum the following series \[\sum_{k=0}^{\infty} {\mathbb{P}}[\tau_1=k] = \sum_{n=0}^{\infty} {\mathbb{P}}[ \tau_1 = 2n+1] = \sum_{n=0}^{\infty} p^{n+1} q^{n} \frac{1}{n+1} \binom{2n}{n} = p \sum_{n=0}^{\infty} (pq)^n \frac{1}{n+1} \binom{2n}{n}.\] The sum looks difficult, so let us plot a numerical approximation of its value for different values of the parameter \(p\) (the true value is plotted in orange):
We conjecture that \({\mathbb{P}}[ \tau_1 <\infty ] = 1\) for \(p\geq \tfrac{1}{2}\), but \({\mathbb{P}}[ \tau_1<\infty]<1\) for \(p<\tfrac{1}{2}\). Indeed, using methods beyond the scope of these notes, it can be shown that our conjecture is true and that \[ {\mathbb{P}}[ \tau_1<\infty ] =\begin{cases} 1, & p \geq \tfrac{1}{2}\\ \frac{p}{q}, & p<\tfrac{1}{2}. \end{cases} \]
Since \({\mathbb{P}}[ \tau_1= \infty]>0\) for \(p<\tfrac{1}{2}\), we can immediately conclude that \({\mathbb{E}}[\tau_1]=\infty\) in that case. Therefore, we assume that \(p\geq \tfrac{1}{2}\), and consider the sum \[ {\mathbb{E}}[\tau_1] = \sum_{k=0}^{\infty} k {\mathbb{P}}[\tau_1 = k] = \sum_{n=0}^{\infty} (2n+1) {\mathbb{P}}[ \tau_1 = 2n+1] = \sum_{n=0}^{\infty} p^{n+1} q^{n} \frac{2n+1}{n+1} \binom{2n}{n}.\] We have already seen that (by Stirling’s formula) we have \(\binom{2n}{n} \sim \frac{2^{2n}}{\sqrt{\pi n}}\), so the question reduces to the one about convergence of the following, simpler, series: \[ \sum_{n=1}^{\infty} \frac{1}{\sqrt{n}} p^n q^{n} 2^{2n} = \sum_{n=1}^{\infty} \frac{1}{\sqrt{n}} (4pq)^n.\] When \(p=\tfrac{1}{2}\), we have \(4pq=1\), and the series above becomes a \(p\)-series with \(p=\tfrac{1}{2}\). Hence, it diverges. On the other hand, when \(p>\tfrac{1}{2}\), \(4pq<1\), the terms of the series are dominated by the terms of the convergent geometric series \(\sum_{n=1}^{\infty} (4pq)^n\). Therefore, it, itself, must converge. All in all: \[ {\mathbb{E}}[\tau_1] = \begin{cases} \infty, & p\leq \tfrac{1}{2}, \\ <\infty, & p > \tfrac{1}{2}. \end{cases}. \]
Let \(a_j = {\mathbb{E}}^{j}[\tau_1]\), where \({\mathbb{E}}^{j}\) means that the random walk starts from the level \(j\), i.e., \(X_0=j\), instead of the usual \(X_0=0\). Think about why it is plausible that the following relations hold for the sequence \(a_n\): \[a_1 = 0,\text{ and } a_j = 1 + p a_{j+1} + q a_{j-1}.\] We guess that \(a_j\) has the form \(a_j = c(1-j)\), for \(j<1\) (why?) and plug that guess into the above equation to get: \[ c(1-j) = 1 + p c (-j) + q c (2-j) = 1 - c - 2 c q + c(1-j).\] It follows that \(c = \tfrac{1}{1-2q} = \tfrac{1}{p-q}\). Thus, if you believe the heuristic, we have \[ {\mathbb{E}}[ \tau_1 ] = \begin{cases} \frac{1}{p-q}, & p>\tfrac{1}{2}, \\ + \infty, & p\leq \tfrac{1}{2}. \end{cases}\] (Note: If you have never seen it before, the approach we took here seems very unusual. Indeed, in order to find the value of \(a_0\) we decided to compute values for the elements of the whole sequence \(a_n\). This kind of thinking will appear many times later in the chapters on Markov Chains.)
A random time is simply a random variable which takes values in the set \({\mathbb{N}}_0\) - it is random, and it can be interpreted as a point in time. Not all random times are created equal, though: here are three examples based on a simple symmetric random walk \(X\):
\(\tau = 3\). This is the simplest random time - it always takes the value \(3\), no matter what. It is random only in the formal sense of the word (just as the constant random vairbale \(X=3\) is a random variable, but not a very interesting one). Constant random times, like \(\tau=3\), are called deterministic times.
\(\tau=\tau_1\) where \(\tau_1\) is the first time \(X\) hits the level \(1\). It is no longer constant - it clearly depends on the underlying trajectory of the random walk: sometimes \(\tau_1=1\); other times it can be very large.
\(\tau=\tau_{\max}\) where \(\tau_{\max}\) is the first time \(X\) takes its maximal value in the interval \(\{0,1,\dots, 100\}\). The random time \(\tau_{\max}\) is clearly non-constant, but it differs from \(\tau=3\) or \(\tau=\tau_1\) in a significant way.
Indeed, the first two examples have the following property:
Given a time \(n\), you can tell whether \(\tau=n\) or not using only the information you have gathered by time \(n\).
The third one does not. Random times with this property are called stopping times. Here is a more precise, mathematical, definition. You should note that we allow our stopping times to take the value \(+\infty\). The usual interpretation is that whatever the stopping time is modeling never happens.
Definition. A random variable \(\tau\) taking values in \({\mathbb{N}}_0\cup\{+\infty\} = \{0,1,2,\dots, +\infty\}\) is said to be a stopping time with respect to the process \(\{X_n\}_{n\in {\mathbb{N}}_0}\) if for each \(n\in{\mathbb{N}}_0\) there exists a function \(G^n:{\mathbb{R}}^{n+1}\to \{0,1\}\) such that \[\mathbf{1}_{\{\tau=n\}}=G^n(X_0,X_1,\dots, X_n), \text{ for all } n\in{\mathbb{N}}_0.\]
The functions \(G^n\) are called the decision functions, and should be thought of as a black box which takes the values of the process \(\{X_n\}_{n\in {\mathbb{N}}_0}\) observed up to the present point and outputs either \(0\) or \(1\). The value \(0\) means keep going and \(1\) means stop. The whole point is that the decision has to be based only on the available observations and not on the future ones.
Alternatively, you can think of a stopping time as an R function whose input is a vector which represents a trajectory \(\omega\) of a random walk (or any other process) and the output is a nonnegative integer. This function needs to be such that if it “decides” to output the value \(k\), it had to have based its decision only on the first \(k\) components of \(\omega\). This means that if the output corresponding to the input trajectory \(\omega\) is \(k\), and \(\omega'\) is another trajectory whose first components match those of \(\omega\), then the output corresponding to \(\omega\)’ must also be \(k\).
Now that we know how to spot stopping times, let’s list some examples:
The simplest examples of stopping times are (non-random) deterministic times. Just set \(\tau=5\) (or \(\tau=723\) or \(\tau=n_0\) for any \(n_0\in{\mathbb{N}}_0\cup\{+\infty\}\)), no matter what the state of the world \(\omega\in\Omega\) is. The family of decision rules is easy to construct: \[G^n(x_0,x_1,\dots, x_n)=\begin{cases} 1,& n=n_0, \\ 0, & n\not= n_0.\end{cases}.\] Decision functions \(G^n\) do not depend on the values of \(X_0,X_1,\dots, X_n\) at all. A gambler who stops gambling after 20 games, no matter what the winnings or losses are uses such a rule.
Probably the most well-known examples of stopping times are (first) hitting times. They can be defined for general stochastic processes, but we will stick to simple random walks for the purposes of this example. So, let \(X_n=\sum_{k=0}^n \delta_k\) be a simple random walk, and let \(\tau_l\) be the first time \(X\) hits the level \(l\in{\mathbb{N}}\). More precisely, we use the following slightly non-intuitive but mathematically correct definition \[\tau_l=\min \{ n\in{\mathbb{N}}_0\, : \, X_n=l\}.\] The set \( \{ n\in{\mathbb{N}}_0\, : \, X_n=l\}\) is the collection of all time-points at which \(X\) visits the level \(l\). The earliest one - the minimum of that set - is the first hitting time of \(l\). In states of the world \(\omega\in\Omega\) in which the level \(l\) just never gets reached, i.e., when \( \{ n\in{\mathbb{N}}_0\, : \, X_n=l\}\) is an empty set, we set \(\tau_l(\omega)=+\infty\).
In order to show that \(\tau_l\) is indeed a stopping time, we need to construct the decision functions \(G^n\), \(n\in{\mathbb{N}}_0\). Let us start with \(n=0\). We would have \(\tau_l=0\) only in the (impossible) case \(X_0=l\), so we always have \(G^0(X_0)=0\). How about \(n\in{\mathbb{N}}\). For the value of \(\tau_l\) to be equal to exactly \(n\), two things must happen:
\(X_n=l\) (the level \(l\) must actually be hit at time \(n\)), and
\(X_{n-1}\not = l\), \(X_{n-2}\not= l\), …, \(X_{1}\not=l\), \(X_0\not=l\) (the level \(l\) has not been hit before).
Therefore, \[G^n(x_0,x_1,\dots, x_n)=\begin{cases} 1,& x_0\not=l, x_1\not= l, \dots, x_{n-1}\not=l, x_n=l\\ 0,&\text{otherwise}. \end{cases}\] The hitting time \(\tau_2\) of the level \(l=2\) for a particular trajectory of a symmetric simple random walk is depicted below:
How about something that is not a stopping time? Let \(T\in{\mathbb{N}}\) be an arbitrary time-horizon and let \(\tau_{\max}\) be the last time during \(0,\dots, T\) that the random walk visits its maximum during \(0,\dots, T\):
If you bought a share of a stock at time \(n=0\), had to sell it some time before or at \(T\) and had the ability to predict the future, this is one of the points you would choose to sell it at. Of course, it is impossible in general to decide whether \(\tau_{\max}=n\), for some \(n\in0,\dots, T-1\) without the knowledge of the values of the random walk after \(n\).
More precisely, let us sketch the proof of the fact that \(\tau_{\max}\) is not a stopping time. Suppose, to the contrary, that it is, and let \(G^n\) be the associated family of decision functions. Consider the following two trajectories: \((0,1,2,3,\dots, T-1,T)\) and \((0,1,2,3,\dots, T-1,T-2)\). They differ only in the direction of the last step. They also differ in the fact that \(\tau_{\max}=T\) for the first one and \(\tau_{\max}=T-1\) for the second one. On the other hand, by the definition of the decision functions, we have \[\mathbf{1}_{\{\tau_{\max}=T-1\}}=G^{T-1}(X_0,\dots, X_{T-1}).\] The right-hand side is equal for both trajectories, while the left-hand side equals to \(0\) for the first one and \(1\) for the second one. A contradiction.
One of the superpowers of stopping times is that they often behave just like deterministic times. The best way to understand this statement is in the context of the beautiful martingale theory. Unfortunately, learning about martingales would take an entire semester, so we have to settle for an illustrative example, namely, Wald’s identity.
Let \(\{\xi_n\}_{n\in{\mathbb{N}}}\) be a sequence of independent and identically distributed random variables. The example you should keep in mind is \(\xi_n = \delta_n\), where \(\delta_n\) are coin tosses in the definition of a random walk. We set \(X_n = \sum_{k=1}^n \xi_k\) and note that it is easy to compute \({\mathbb{E}}[X_n]\): \[ {\mathbb{E}}[ X_n ] = {\mathbb{E}}[ \xi_1+\dots + \xi_n] = {\mathbb{E}}[\xi_1] + \dots + {\mathbb{E}}[\xi_n] = n \mu, \text{ where } \mu = {\mathbb{E}}[\xi_1]={\mathbb{E}}[\xi_2]=\dots\] provided \({\mathbb{E}}[\xi_1]\) exists. The expected value \(\mu\) is the same for all \(\xi_1,\xi_2,\dots\) because they all have the same distribution. In words, the equality above tells us that the expected value of \(X\) moves with speed \(\mu\). Wald’s identity tells us that the same thing is true when the deterministic time \(n\) is replaced by a stopping time. To understand its statement below, we must first introduce a bit more notation. Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a stochastic process, and let \(\tau\) be a random time which never takes the value \(+\infty\). Remember that \(X_0, X_1, \dots\) are random variables, i.e., functions of the elementary outcome \(\omega\in\Omega\). The same is true for \(\tau\). Therefore, in order to define the random variable \(X_{\tau}\) we need to specify what its value is for any given \(\omega\): \[ X_{\tau} (\omega) = X_{n}(\omega) \text{ where } n=\tau(\omega).\] This is exactly what you would expect; the elementary outcome \(\omega\) not only tells us which trajectory of the process to consider, but also the time at which to do it. Note that when \(\tau=n\) is a deterministic time, \(X_{\tau}\) is exactly \(X_n\).
Theorem. (Wald’s identity) Let \(\{\xi_n\}_{n\in{\mathbb{N}}}\) be a sequence of independent and identically distributed random variables, and let \(X_n = \sum_{k=1}^n \xi_k\) be the associated random walk. If \({\mathbb{E}}[ |\xi_n|]<\infty\) and \(\tau\) is a stopping time for \(\{X_n\}_{n\in {\mathbb{N}}_0}\) such that \({\mathbb{E}}[\tau]<\infty\), then \[ {\mathbb{E}}[X_{\tau}] = {\mathbb{E}}[\tau] \mu \text{ where } \mu = {\mathbb{E}}[\xi_1] = {\mathbb{E}}[\xi_2] = \dots \]
Before we prove this theorem, here is a handy identity:
(The “tail formula” for the expectation) Let \(\tau\) be an \({\mathbb{N}}_0\)-valued random variable. Show that \[{\mathbb{E}}[\tau]=\sum_{k=1}^{\infty} {\mathbb{P}}[\tau \geq k].\]
Clearly, \({\mathbb{P}}[\tau\geq k] = {\mathbb{P}}[ \tau=k] + {\mathbb{P}}[\tau=k+1]+\dots\). Therefore,
\[ \begin{array}{cccccccc} \sum_{k=1}^{\infty} {\mathbb{P}}[\tau \geq k] &=& {\mathbb{P}}[ \tau=1] &+& {\mathbb{P}}[\tau=2] &+& {\mathbb{P}}[\tau=3] &+& \dots \\ && &+& {\mathbb{P}}[\tau=2] &+& {\mathbb{P}}[\tau=3] &+& \dots \\ && && &+& {\mathbb{P}}[\tau=3] &+& \dots \\ && && && &+& \dots \end{array} \] If you look at the “columns”, you will realize that the expression \({\mathbb{P}}[\tau=1]\) appears in this sum once, \({\mathbb{P}}[\tau=2]\) twice, \({\mathbb{P}}[\tau=3]\) three times, etc. Hence \[\sum_{k=1}^{\infty} {\mathbb{P}}[ \tau\geq k] = \sum_{n=1}^{\infty} n {\mathbb{P}}[\tau=n] = {\mathbb{E}}[\tau].\]
Prove Wald’s identity.
Here is another representation of the random variable \(X_{\tau}\): \[X_{\tau} = \sum_{k=1}^{\tau} \xi_k=\sum_{k=1}^{\infty} \xi_k \mathbf{1}_{\{k\leq \tau\}}.\] The idea behind it is simple: add all the values of \(\xi_k\) for \(k\leq \tau\) and keep adding zeros (since \(\xi_k \mathbf{1}_{\{k\leq \tau\}}=0\) for \(k>\tau\)) after that. Taking expectation of both sides and switching \({\mathbb{E}}\) and \(\sum\) (this can be justified, but the argument is technical and we omit it here) yields: \[ {\mathbb{E}}[\sum_{k=1}^{\tau} \xi_k]=\sum_{k=1}^{\infty} {\mathbb{E}}[ \mathbf{1}_{\{k\leq \tau\}}\xi_k]. \] Let us examine the term \({\mathbb{E}}[\xi_k\mathbf{1}_{\{k\leq \tau\}}]\) in some detail. We first note that \[\mathbf{1}_{\{k\leq \tau\}}=1-\mathbf{1}_{\{k>\tau\}}=1-\mathbf{1}_{\{k-1\geq \tau\}}=1-\sum_{j=0}^{k-1}\mathbf{1}_{\{\tau=j\}},\] so that \[ {\mathbb{E}}[\xi_k \mathbf{1}_{\{k\leq \tau\}}]={\mathbb{E}}[\xi_k]-\sum_{j=0}^{k-1}{\mathbb{E}}[ \xi_k \mathbf{1}_{\{\tau=j\}} ].\] By the assumption that \(\tau\) is a stopping time, the indicator \(\mathbf{1}_{\{\tau=j\}}\) can be represented as \(\mathbf{1}_{\{\tau=j\}}=G^j(X_0,\dots, X_j)\), and, because each \(X_i\) is just a sum of the increments \(\xi_1, \dots, \xi_i\), we can actually write \(\mathbf{1}_{\{\tau=j\}}\) as a function of \(\xi_1,\dots, \xi_j\) only: \(\mathbf{1}_{\{\tau=j\}}=H^j(\xi_1,\dots, \xi_j).\) By the independence of \((\xi_1,\dots, \xi_j)\) from \(\xi_k\) (because \(j<k\)) we have \[\begin{align} {\mathbb{E}}[\xi_k \mathbf{1}_{\{\tau=j\}}]&={\mathbb{E}}[ \xi_k H^j(\xi_1,\dots, \xi_j)]= {\mathbb{E}}[\xi_k] {\mathbb{E}}[ H^j(\xi_1,\dots, \xi_j)]={\mathbb{E}}[\xi_k] {\mathbb{E}}[\mathbf{1}_{\{\tau=j\}}]= {\mathbb{E}}[\xi_k]{\mathbb{P}}[T=j]. \end{align}\] Therefore, \[\begin{align} {\mathbb{E}}[\xi_k \mathbf{1}_{\{k\leq \tau\}}]&={\mathbb{E}}[\xi_k]-\sum_{j=0}^{k-1} {\mathbb{E}}[\xi_k] {\mathbb{P}}[\tau=j]={\mathbb{E}}[\xi_k] {\mathbb{P}}[\tau\geq k] =\mu {\mathbb{P}}[\tau\geq k], \end{align}\] where the last equality follows from the fact that all \(\xi_k\) have the same expectation, namely \(\mu\).
Putting it all together, we get \[\begin{align} {\mathbb{E}}[X_{\tau}]&={\mathbb{E}}[\sum_{k=1}^{\tau} \xi_k]=\sum_{k=1}^{\infty} \mu {\mathbb{P}}[\tau\geq k]=\mu \sum_{k=1}^{\infty} {\mathbb{P}}[\tau\geq k]= {\mathbb{E}}[\tau] \mu, \end{align}\] where we use the “tail formula” to get the last equality.
Show, by giving an example, that Wald’s identity does not necessarily hold if \(\tau\) is not a stopping time.
Let \(X\) be a simple symmetric random walk, and let \(\tau\) be a random time constructed like this: \[\begin{align} \tau = \begin{cases} 1, & X_1=1 \\ 0,& X_1=-1. \end{cases} \end{align}\] Then, \[\begin{align} X_{\tau} = \begin{cases} X_1, & X_1=1 \\ X_0, & X_1=-1, \end{cases} = \begin{cases} 1, & X_1=1 \\ 0,& X_1=-1. \end{cases} \end{align}\] and, therefore, \({\mathbb{E}}[ X_{\tau}] = 1 \cdot 1/2 + 0 \cdot 1/2 = 1/2\). On the other hand \(\mu={\mathbb{E}}[\xi_1]=0\) and \({\mathbb{E}}[\tau] = 1/2\), so \(1/2 = {\mathbb{E}}[X_{\tau}] \ne {\mathbb{E}}[\tau] \mu = 0\).
It is clear that \(\tau\) cannot be a stopping time, since Wald’s identity would hold for it if it were. To see that it is not more directly, consider the event when \(\tau=0\). Its occurrence depends on whether \(X_1=1\) or not, which is not known at time \(0\).
A famous use of Wald’s identity is in the solution of the following classical problem:
A gambler starts with \(\$x\) dollars and repeatedly plays a game in which she wins a dollar with probability \(\tfrac{1}{2}\) and loses a dollar with probability \(\tfrac{1}{2}\). She decides to stop when one of the following two things happens:
she goes bankrupt, i.e., her wealth hits \(0\), or
she makes enough money, i.e., her wealth reaches some predetermined level \(a>x\).
The “Gambler’s ruin” problem (dating at least to 1600s) asks the following question: what is the probability that the gambler will make \(a\) dollars before she goes bankrupt?
Let the gambler’s “wealth” \(\{W_n\}_{n\in {\mathbb{N}}_0}\) be modeled by a simple random walk starting from \(x\), whose increments \(\xi_k=\delta_k\) are coin-tosses. Then \(W_n=x+X_n\), where \(X_n = \sum_{k=1}^n \xi_k\) is a SSRW. Let \(\tau\) be the time the gambler stops. We can represent \(\tau\) in two different (but equivalent) ways. On the one hand, we can think of \(T\) as the smaller of the two hitting times \(\tau_{-x}\) and \(\tau_{a-x}\) of the levels \(-x\) and \(a-x\) for the random walk \(\{X_n\}_{n\in {\mathbb{N}}_0}\) (remember that \(W_n=x+X_n\), so these two correspond to the hitting times for the process \(\{W_n\}_{n\in {\mathbb{N}}_0}\) of the levels \(0\) and \(a\)). On the other hand, we can think of \(\tau\) as the first hitting time of the two-element set \(\{-x,a-x\}\) for the process \(\{X_n\}_{n\in {\mathbb{N}}_0}\). In either case, it is quite clear that \(\tau\) is a stopping time (can you write down the decision functions?).
When we talked about the maximum of the simple symmetric random walk, we proved that it hits any value if given enough time. Therefore, the probability that the gambler’s wealth will remain strictly between \(0\) and \(a\) forever is zero. So, \({\mathbb{P}}[T<\infty]=1\).
What can we say about the random variable \(X_{\tau}\) - the gambler’s wealth (minus \(x\)) at the random time \(\tau\)? Clearly, it is either equal to \(-x\) or to \(a-x\), and the probabilities \(p_0\) and \(p_a\) with which it takes these values are exactly what we are after in this problem. We know that, since there are no other values \(X_{\tau}\) can take, we must have \(p_0+p_a=1\). Wald’s identity gives us another equation for \(p_0\) and \(p_a\): \[{\mathbb{E}}[X_{\tau}]={\mathbb{E}}[\xi_1] {\mathbb{E}}[\tau]=0\cdot {\mathbb{E}}[\tau]=0 \text{ so that } 0 = {\mathbb{E}}[X_{\tau}]=p_0 (-x)+p_a (a-x).\]
We now have a system of two linear equations with two unknowns, and solving it yields \[p_0= \frac{a-x}{a}, \ p_a=\frac{x}{a}.\] It is remarkable that the two probabilities are proportional to the amounts of money the gambler needs to make (lose) in the two outcomes. The situation is different when \(p\not=\tfrac{1}{2}\).
In order to be able to use Wald’s identity, we need to check its conditions. We have already seen that \(\tau\) needs to be a stopping time, and not just any old random time. There are also two conditions about the expected values of \(\tau\) and of \(\xi_1\). If you read the above solution carefully, you will realize that we never checked whether \({\mathbb{E}}[\tau]<\infty\). We should have, but we did not because we still don’t have the mathematical tools to do it. We will see later that, indeed, \({\mathbb{E}}[\tau]<\infty\) for this particular stopping time. In general, the condition that \({\mathbb{E}}[\tau]<\infty\) is important, as the following simple example shows:
Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a simple symmetric random walk, and let \(\tau_1\) be the first hitting time of the level \(1\). Use Wald’s identity to show that \({\mathbb{E}}[\tau]=+\infty\).
Suppose, to the contrary, that \({\mathbb{E}}[\tau]<\infty\). Since \({\mathbb{E}}[\delta_1]<\infty\) and \(\tau_1\) is a stopping time, Wald’s identity applies: \[ {\mathbb{E}}[X_{\tau_1}] = {\mathbb{E}}[ \delta_1] \cdot {\mathbb{E}}[\tau_1].\] The right hand side is then equal to \(0\) because \({\mathbb{E}}[\delta_1]=0\). On the other hand, \(X_{\tau_1}=1\): the value of \(X_n\) when it first hits the level \(1\) is, of course, \(1\). This leads to a contradiction \(1={\mathbb{E}}[X_{\tau_1}] = {\mathbb{E}}[\delta_1] {\mathbb{E}}[\tau_1] = 0\). Therefore, our initial assumption that \({\mathbb{E}}[\tau_1]<\infty\) was wrong!
We close this chapter with another identity of Abraham Wald, namely ``Wald’s second identity’’. The original identity helped us compute the expected value of the position of a random walk at a stopping time. The second one computes the variance:
Theorem. (Wald’s second identity) Let \(\{\xi_n\}_{n\in{\mathbb{N}}}\) be a sequence of independent and identically distributed random variables, and let \(X_n = \sum_{k=1}^n \xi_k\) be the associated random walk. If \({\mathbb{E}}[ (\xi_n)^2]<\infty\) and \(\tau\) is a stopping time for \(\{X_n\}_{n\in {\mathbb{N}}_0}\) such that \({\mathbb{E}}[\tau]<\infty\), then \[ \operatorname{Var}[X_{\tau}]= {\mathbb{E}}[\tau] \sigma^2 \text{ where } \sigma^2 = \operatorname{Var}[\xi_1] = \operatorname{Var}[\xi_2] = \dots \]
The proof is similar (but more difficult) than the proof of Wald’s (first) identity, so we skip it.
Let \(\{X_n\}_{n\in {\mathbb N}_0}\) be a simple symmetric random walk.
Compute \({\mathbb{P}}[ X_1 + X_2 + X_3 > 0]\).
Compute \({\mathbb{P}}[ X_3 = 1, X_9 = 1 \text{ and } X_{15} = 3]\).
Compute \({\mathbb{P}}[ X_n\geq -2 \text{ for all $n\leq 10$}]\).
Find the distribution (table) of the product \(X_1 X_3\).
There are \(8\) possible trajectories \((0,x_1,x_2,x_3)\) of length \(3\) that a random walk can take. Out of the \(8\), only the following three have \(x_1+x_2+x_3>0\): \[\begin{align} (0,1,2,3), (0,1,2,1) \text{ and } (0,1,0,1) \end{align}\] Since the walk is symmetric, each of those has probability \(1/8\), so \({\mathbb{P}}[ X_1+X_2+X_3>0] = 3/8\).
Since the increments of the random walk are independent, we have \[\begin{align} {\mathbb{P}}[ X_3 = 1, X_9=1 \text{ and } X_{15} = 3] & = {\mathbb{P}}[ X_1 = 1, \, X_9-X_1 = 0,\, X_{15} - X_9 = 2]\\ &= {\mathbb{P}}[ X_3 = 1]\times {\mathbb{P}}[ X_9 - X_3 =0 ] \times {\mathbb{P}}[ X_{15} - X_9 = 2] \end{align}\] Moreover, we have \({\mathbb{P}}[X_9 - X_3=0] = {\mathbb{P}}[ X_6 - X_0=0] = {\mathbb{P}}[ X_6 = 0]\). Similarly \({\mathbb{P}}[ X_{15} - X_9 = 2] = {\mathbb{P}}[ X_6 = 2]\). It remains to use the formula for probabilities of the form \({\mathbb{P}}[ X_n = k]\) from the notes to obtain
\[\begin{align} {\mathbb{P}}[ X_3 = 1, X_9=1 \text{ and } X_{15} = 3] & = \binom{3}{2} 2^{-3} \times \binom{6}{3} 2^{-6} \times \binom{6}{4} 2^{-6}= 2^{-15} \binom{6}{4} \binom{6}{3} \end{align}\]
By symmetry, this is the same as \({\mathbb{P}}[ X_n \leq 2, 0\leq n \leq 10] = {\mathbb{P}}[ M_{10} \leq 2]\), where \(M\) is the running-maximum process. By (the formula derived from) the reflection principle, we have \[\begin{align} {\mathbb{P}}[ M_n \leq 2] &= {\mathbb{P}}[ M_{10} = 2]+{\mathbb{P}}[M_{10} = 1] + {\mathbb{P}}[ M_{10} = 0] =({\mathbb{P}}[ X_{10} = 2] + {\mathbb{P}}[X_{10}=3]) +({\mathbb{P}}[X_{10} = 1] + {\mathbb{P}}[X_{10} = 2]) + ({\mathbb{P}}[ X_{10}=0] + {\mathbb{P}}[ X_{10}=1]) \\ &= {\mathbb{P}}[ X_{10} = 0] + 2 {\mathbb{P}}[ X_{10} = 2] = \binom{10}{5} 2^{-10} + 2 \times \binom{10}{6} 2^{-10}. \end{align}\]
There are several ways of solving this problem. The simplest one would be to list all \(8\) trajectories of the random walk of length \(3\) compute the value of \(X_1 \times X_3\) on each of them:
|
trajectory |
value |
|---|---|
|
(0,1,2,3) |
3 |
|
(0,1,2,1) |
1 |
|
(0,1,0,1) |
1 |
|
(0,1,0,-1) |
-1 |
|
(0,-1,0,1) |
-1 |
|
(0,-1,0,-1) |
1 |
|
(0,-1,-2,-1) |
1 |
|
(0,-1,-2,-3) |
3 |
Since each trajectory has probability \(1/8\), counting the number of times each of the possible values \(-1\), \(1\) or \(3\) appears gives us the distribution of \(X_1 X_3\):
|
-1 |
1 |
3 |
|---|---|---|
|
0.25 |
0.5 |
0.25 |
Alternatively, we can write \(X_1\) and \(X_3\) as a sum of independent coin tosses and obtain \[\begin{align} X_1 X_3 = X_1 (X_1+ X_3 - X_1) = X_1^2 + X_1(X_3 - X_1) = 1 + X_1 (X_3 - X_1). \end{align}\] The random variable \(X_3 - X_1\) has the same distribution as \(X_2\). It is, also, independent of \(X_1\), so multiplying it by \(X_1\) is equivalent to switching its sign with probability \(1/2\), independently of its value. But \(X_2\) is symmetric so this independent sign switch does not affect its distribution. Hence \(X_1(X_3-X_1)\) has the same distribution as \(X_2\), and since \(X_2\) takes values \(-2,0\) and \(2\) with probabilities \(0.25, 0.5\) and \(0.25\), we arrive quickly at the distribution table above.
Let \(\{X_n\}_{0\leq n \leq 10}\) be a simple symmetric random walk with time horizon \(T=10\). What is the probability it will never reach the level \(5\)?
A fair coin is tossed repeatedly, with the first toss resulting in \(H\) (i.e., heads). After that, each time the outcome of the coin matches the previous outcome, the player gets a dollar. If the two do not match, the player has to pay a dollar. The player stops playing once she “earns” \(10\) dollars. What is the probability that she will need at least 20 tosses (including the first one) to achieve that?
A fair coin is tossed repeatedly and the record of the outcomes is kept. Tossing stops the moment the total number of heads obtained so far exceeds the total number of tails by 3. For example, a possible sequence of tosses could look like HHTTTHHTHHTHH. What is the probability that the length of such a sequence is at most 10?
\[\begin{aligned} {\mathbb{P}}[ M_{10}\leq 4 ] &= {\mathbb{P}}[ M_{10}=0] + {\mathbb{P}}[ M_{10}=1] + {\mathbb{P}}[ M_{10} = 2] + {\mathbb{P}}[M_{10} = 3] + {\mathbb{P}}[ M_{10} = 4] \\ & = ({\mathbb{P}}[ X_{10} = 0] + {\mathbb{P}}[ X_{10} = 1] ) + ({\mathbb{P}}[ X_{10} = 1] + {\mathbb{P}}[ X_{10} = 2] ) \\ & + ({\mathbb{P}}[ X_{10} = 2] + {\mathbb{P}}[ X_{10} = 3] ) + ({\mathbb{P}}[ X_{10} = 3] + {\mathbb{P}}[ X_{10} = 4] )\\ & + ({\mathbb{P}}[ X_{10} = 4] + {\mathbb{P}}[ X_{10} = 5] ) \\ &= 2 ({\mathbb{P}}[ X_{10}=4] + {\mathbb{P}}[X_{10} = 2]) + {\mathbb{P}}[X_{10} =0] \\&= 2^{-10}( 2 \binom{10}{7} + 2 \binom{10}{6} + \binom{10}{5}) \end{aligned}\]
Let the outcomes of the coin tosses be denoted by \(\gamma_1 = H\), \(\gamma_2, \gamma_3, \dots\). We define the random variables \(\delta_1,\delta_2,\dots\) as follows: \(\delta_1 = 1\) if \(\gamma_2 = T\) and \(\delta_1 = -1\), otherwise. Similarly, \(\delta_2 = 1\) if \(\gamma_3 = \gamma_2\) and \(-1\) otherwise. It is clear that \(\delta_1,\delta_2,\dots\) is an iid sequence of coin tosses (just like in the definition of) of a simple symmetric random walk. After \(n\) tosses ( the first one), our gambler has \(X_n = \delta_1+\delta_2 + \dots + \delta_n\) dollars. She will need at least 19 tosses (excluding the first one) to reach \(10\) dollars if and only if the value of the running maximum process at time \(n=18\) is at most \(9\). Using the formula from the formula sheet, this evaluates to \[\begin{aligned} {\mathbb{P}}[ M_{18}\leq 9] &= \sum_{k=0}^{9} {\mathbb{P}}[ M_{18} = k] = \sum_{k=0}^{9} ({\mathbb{P}}[X_{18}=k] + {\mathbb{P}}[ X_{18} = k+1])\\ & = {\mathbb{P}}[ X_{18} =0 ] + 2\, {\mathbb{P}}[X_{18} = 2] + 2\, {\mathbb{P}}[X_{18} = 4] + \dots \\ & \qquad \dots + 2\, {\mathbb{P}}[ X_{18} = 8] + {\mathbb{P}}[ X_{18} = 10] \\ &= 2^{-18}\left( \binom{18}{9} + 2 \binom{18}{10} + 2\binom{18}{11} + 2 \binom{18}{12} + 2 \binom{18}{13} + \binom{18}{14}\right) \end{aligned}\] Btw, you could have gotten a seemingly different answer. Since it is impossible to reach \(10\) in exactly \(19\) steps (the parity is wrong), the required probability is also equal to \[\begin{align} {\mathbb{P}}[ M_{19}\leq 9] &= \sum_{k=0}^9 \Big( {\mathbb{P}}[ X_{19} = k] + {\mathbb{P}}[ X_{19} = k+1] \Big) = 2^{-19} \times 2 \times \sum_{k=1}^9 \binom{19}{(19+k)/2}\\ &= 2^{-18} \times \left( \binom{19}{10} + \binom{19}{11} + \dots + \binom{19}{16} \right). \end{align}\]
Let \(X_n\), \(n\in{\mathbb{N}}_0\) be the number of heads minus the number of tails obtained so far. Then, \(\{X_n\}_{n\in {\mathbb{N}}_0}\) is a simple symmetric random walk, and we stop tossing the coin when \(X\) hits \(3\) for the first time. This will happen during the first 10 tosses, if and only if \(M_{10} \geq 3\), where \(M_n\) denotes the (running) maximum of \(X\). According to the reflection principle, \[\nonumber \begin{split} {\mathbb{P}}[M_{10} \geq 3]&= {\mathbb{P}}[ X_{10} \geq 3 ] + {\mathbb{P}}[ X_{10} \geq 4]\\ & = 2( {\mathbb{P}}[X_{10}= 4] +{\mathbb{P}}[X_{10}= 6] +{\mathbb{P}}[X_{10}= 8] +{\mathbb{P}}[X_{10}= 10])\\ &= 2^{-9} \left[ \binom{10}{3}+\binom{10}{2}+\binom{10}{1}+\binom{10}{0} \right] = {\frac{11}{32}}. \end{split}\]
Luke starts a random walk, where each step takes him to the left or to the right, with the two alternatives being equally likely and independent of the previous steps. \(11\) steps to his right is a cookie jar, and Luke gets to take a (single) cookie every time he reaches that position. He performs exactly \(15\) steps, and then stops.
What is the probability that Luke will be exactly by the cookie jar when he stops?
What is the probability that Luke stops with with exactly \(3\) cookies in his hand?
What is the probability that Luke stops with at least one cookie in his hand?
Suppose now that we place a bowl of broccoli soup one step to the right of the cookie jar. It smells so bad that, if reached, Luke will throw away all the cookies he is currently carrying (if any) and run away pinching his nose. What is the probability that Luke will finish his \(15\)-step walk without ever encountering the yucky bowl of broccoli soup and with at least one cookie in his hand?
Let \(C_n = \frac{1}{n+1}\binom{2n}{n}\) denote the \(n\)-th Catalan number, as defined at the end of the discussion of the Balot problem above.
Use the reflection principle to show that \(C_n\) is the number trajectories \((x_0,\dots, x_{2n})\) of a random walk with time horizon \(T=2n\) such that \(x_k \geq 0\), for all \(k\in\{0,1,\dots, 2n\}\) and \(x_{2n}=0\).
Prove the Segner’s recurrence formula \(C_{n+1} = \sum_{i=0}^n C_{i} C_{n-i}\). .
Show that \(C_n\) is the number of ways the vertices of a regular \(2n\)-gon can be paired so that the line segments joining paired vertices do not intersect.
Let \(\{X_n\}_{n\in {\mathbb{N}}_0}\) be a simple symmetric random walk. Given \(n\in{\mathbb{N}}\), what is the probability that \(X\) does not visit \(0\) during the time interval \(1,\dots, n\).
Let \(\tau_{-1}\) be the hitting time of the level \({-1}\) for a simple biased random walk \(\{X_n\}_{n\in {\mathbb{N}}_0}\). Choose the correct answer(s) (they will depend on the value of the parameter \(p\)):
Hitting the level \(-1\) for a biased random walk with parameter \(p\) is equivalent to hitting the level \(1\) for a biased random walk with parameter \(1-p\). The correct answers are: a and d, for \(p<1/2\), c, for \(p=1/2\) and b for \(p<1/2\).
Let \(\tau\) and \(\tilde{\tau}\) be two stopping times. Which of the following are necessarily stopping times, as well:
Either one of the following \(4\) random times is not a stopping time for a simple random walk \(\{X_n\}_{n\in {\mathbb{N}}_0}\), or they all are. Choose the one which is not in the first case, or choose e. if you think they all are.
The correct answer is e. The first, second, third, or … hitting times of a level are stopping times, and so are their minima or maxima. Note that for two stopping times \(\tau_1\) and \(\tau_2\), the one that happens first is \(\min(\tau_1,\tau_2)\) and the one that happens last is \(\max(\tau_1,\tau_2)\).
At most one of the following \(4\) random times is a stopping time for a simple random walk \(\{X_n\}_{n\in {\mathbb{N}}_0}\). Either choose the one which you think is a stopping time, or choose e. if you think there are no stopping times among them.
The correct answer is e. For each of the random times in a.-d. you need to know something about the future to tell whether they happened when they happened. For example, for \(c.\), you have no way of knowing (in general) whether or not \(X_{2} - X_1\) equals \(X_3\) at time \(2\).
it may interfere with your existing installation↩︎
be careful, though. The expression x = y is
not the same as x == y. It does not return a logical value
- it assigns the value of y to x↩︎
There are infinitely many ways random variables can be distributed. Indeed, in the discrete \({\mathbb N}\)-valued case only, any sequence of nonnegative numbers \((p_n)_n\) such that \(\sum_n p_n=1\) defines a probability distribution. It turns out, however, that a small-ish number of distributions appear in nature much more often then the rest. These distributions, like the normal, uniform, exponential, binomial, etc. turn out to be so important that they each get a name (hence named distributions). ↩︎
Some books will define the geometric random variables as the number of tosses (and not Ts) before the first H is obtained. In that case, the final H is included into the count. Clearly, this definition and the one we have given differ by \(1\), and this is really not a big deal, but you have to be careful about what is meant when a geometric random variable is mentioned.↩︎
The function sum adds up all the components
of the vector. You would not want such a function to be vectorized. If
it were, it would return exactly the same vector it got as input.↩︎
It is somewhat unfortunate that the standard notation
for the time horizon, namely \(T\),
coincides with a shortcut T for TRUE in R. Our
example still works fine because this shortcut is used only if there is
no variable named T.↩︎
The function apply is often used as a
substitute for a for loop because it has several advantages
over it. First, the code is much easier to read and understand. Second,
apply can easily be parallelized. Third, while this is not
such a big issue anymore, for loops used to be orders of
magnitude slower than the corresponding apply in the past.
R’s for loops got much better recently, but they still lag
behind apply in some cases. To be fair, apply
is known to use more memory than for in certain
cases.↩︎